[Update 2022/10/05, 12:40PM CST]
Work has been completed on one cable and associated systems connecting to the storage were restored back to normal. We’re going to do stability assessment of the system after first cable replacement and schedule second cable replacement sometime next week.
[Update 2022/10/05, 10:10AM CST]
As the work is still ongoing we’re experiencing issues with one of the cable replacement. While there is still redundant controller in place we already identified an impact on some users where the data are not currently accessible. In order to minimize impact on the system we’ve decided to pause scheduler to prevent new jobs from starting and crashing. Running jobs may be impacted by the storage outage.
Please, be mindful about opening new ticket to pace-support@oit.gatech.edu if your issue is storage related.
[Original post]
Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance
Details: Two cables connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, both needs to be replaced, beginning around 10AM Wednesday, October 5th, 2022. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours. After the replacement, pools will need to rebuild over the course of about a day.
Impact: Since there is a redundant controller when doing work on one cable, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.