PACE A Partnership for an Advanced Computing Environment

January 25, 2022

[Postponed] Phoenix Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Michael Weiner @ 9:28 am

[Update 1/26/22 6:00 PM]

Due to complications associated with a similar repair on the Hive cluster this morning, we have decided to postpone replacement of the storage cable on the Phoenix cluster. This repair to the Phoenix Lustre project & scratch storage will now occur during our upcoming maintenance period.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 1/25/22 9:30 AM]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Phoenix Lustre device, hosting project and scratch storage, to one of its controllers needs to be replaced, beginning around 12:00 noon Wednesday (January 26). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

January 24, 2022

[Resolved] Hive Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Michael Weiner @ 1:25 pm

[Update 1/26/22 5:45 PM]

The PACE team, working with our support vendor, has restored the Hive GPFS project & scratch storage system, and the scheduler is again starting jobs.

We have followed up directly with all individuals with potentially impacted jobs from this morning. Please resubmit any jobs that failed.

Please accept our sincere apology for any inconvenience that this outage may have caused you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Update 1/26/22 10:40 AM]

The Hive GPFS storage system is down at this time, so Hive project (data) and scratch storage are unavailable. The PACE team is currently working to restore access. In order to avoid further disruption, we have paused the Hive scheduler, so no additional jobs will start. Jobs that were already running may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

We will update you when the system is restored.

[Original Post 1/24/22 1:25 PM]

Summary: Hive project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10:00 AM Wednesday (January 26). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

January 10, 2022

Operating System Upgrade to RHEL7.9

Filed under: Uncategorized — Michael Weiner @ 3:45 pm

[Update 1/10/22 3:45 PM]

Testflight environments are now available for you to prepare for the upgrade of PACE’s Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies. 

All PACE researchers are strongly encouraged to test all workflows they regularly run-on PACE. Please conduct your testing at your earliest convenience to avoid delays to your research. An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software, including updates and modifications to our scientific software repository. PACE is providing updated modules for all our Message Passing Interface (MPI) options. 

For details of what to test and how to access our Testflight-Coda (Phoenix) and Testflight-Hive environments, please visit our RHEL7.9 upgrade documentation.  

Please let us know if you encounter any issues with the upgraded environment. Our weekly PACE Consulting Sessions are a great opportunity to work with PACE’s facilitation team on your testing and upgrade preparation. Visit the schedule of upcoming sessions to find the next opportunity.  

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Original Post 12/7/21 3:30 PM]

Summary: Operating System Upgrade to RHEL7.9

What’s happening and what are we doing: PACE will upgrade our Phoenix, Hive, and Firebird clusters to the Red Hat Enterprise Linux (RHEL) 7.9 operating system from RHEL 7.6 during the February 9-11 maintenance period. The upgrade timing of the ICE clusters will be announced later. The required upgrade will improve the security of our clusters to comply with GT Cybersecurity policies and will also update our software repository.

PACE will provide researchers with access to a “testflight” environment in advance of the upgrade, allowing you the opportunity to ensure your software works in the new environment. More details will follow at a later time, including how to access the testing environment for each research cluster.

How does this impact me:

  • An OpenFabrics Enterprise Distribution (OFED) upgrade requires rebuilding our MPI software. PACE is providing updated modules for all of our Message Passing Interface (MPI) options and testing their compatibility with all software PACE installs in our scientific software repository.
  • Researchers who built their own software may need to rebuild it in the new environment and are encouraged to use the testflight environment to do so. Researchers who have contributed to PACE Community applications (Tier 3) should test their software and upgrade it if necessary to ensure continued functionality.
  • Researchers that have installed their own MPI code independent of PACE’s MPI installations will need to rebuild it in the new environment.
  • Due to the pending upgrade, software installation requests may be delayed in the coming months. Researchers are encouraged to submit a software request and discuss their specific needs with our software team research scientists. As our software team focuses on preparing the new environment and ensuring that existing software is compatible, requests for new software may take longer than usual to be fulfilled.

What we will continue to do: PACE will ensure that our scientific software repository is compatible with the new environment and will provide researchers with a testflight environment in advance of the migration, where you will be able to test the upgraded software or rebuild your own software. We will provide additional details as they become available.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Powered by WordPress