[Update – 06/25 11:40PM]
The storage controller cable on Hive cluster was replaced this evening and brought back online. Unfortunately, after the repairs, GPFS storage mounts became unavailable, which had interrupted users’ running jobs this evening. We’ve paused the scheduler briefly while we restarted the GPFS services across the cluster. The storage mounts were restored, and scheduler has been resumed.
User’s jobs that have been running/queued between about 7:00pm and 10:30pm today (6/25/2021) may have been interrupted, and we recommend the users to check on their jobs and resubmit your jobs as needed. Please accept our sincerest apology for this inconvenience.
We will continue to monitor the services and update as needed. If you have any questions, please contact us at pace-support@oit.gatech.edu.
[Original Message – 06/25 5:12PM]
Dear Hive Users,
We are reaching out to inform you that one of our storage controllers for Hive cluster has a bad cable that needs to be replaced to ensure optimal performance and data integrity. We have the cable at hand, and are in a process of replacing this cable this evening, Friday 06/25/2021. This work will impact storage performance briefly, which users may experience as storage slowness as we are routing all our traffic to a secondary controller during this operation.
What’s happening and what we are doing: More specifically, PACE has assessed a high failure rate of the disks in one of the enclosures for the storage controller with a bad cable. As a precaution, we will be shutting down the controller with the bad cable to unfail the disks and to ensure data integrity of the system. We will work on replacing the cable this evening during which the controller will be shutdown. During this work, all storage traffic will be routed to a secondary controller that is fully operational. Given the anticipated load on the secondary controller, we anticipate users experiencing performance degradation.
How does this impact me: With only one storage control in operation, users may experience storage slowness. In a highly unlikely event, this could cause downtime to the storage which would impact all users’ running jobs; however, we do not anticipate any storage outage during this operation.
What we will continue to do: PACE team will work on the cable replacement and restore the storage to optimal operation, and update the community as needed.
Please accept our sincere apology for any inconvenience that this may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu
Best,
The PACE Team