PACE A Partnership for an Advanced Computing Environment

January 24, 2022

[Resolved] Hive Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Michael Weiner @ 1:25 pm

[Update 1/26/22 5:45 PM]

The PACE team, working with our support vendor, has restored the Hive GPFS project & scratch storage system, and the scheduler is again starting jobs.

We have followed up directly with all individuals with potentially impacted jobs from this morning. Please resubmit any jobs that failed.

Please accept our sincere apology for any inconvenience that this outage may have caused you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

[Update 1/26/22 10:40 AM]

The Hive GPFS storage system is down at this time, so Hive project (data) and scratch storage are unavailable. The PACE team is currently working to restore access. In order to avoid further disruption, we have paused the Hive scheduler, so no additional jobs will start. Jobs that were already running may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

We will update you when the system is restored.

[Original Post 1/24/22 1:25 PM]

Summary: Hive project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

What’s happening and what are we doing: A cable connecting one enclosure of the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10:00 AM Wednesday (January 26). After the replacement, pools will need to rebuild over the course of about a day.

How does this impact me: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.
In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again.

What we will continue to do: PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress