[Update 11/12/2023 11:15 PM]
The rebuild process completed on Sunday afternoon, and the system has returned to normal performance.
[Update 11/11/2023 6:40 PM]
Unfortunately the rebuilding is still going. Another drive has failed and it is slowing down the rebuilding process. We keep monitoring the situation closely.
[Update 11/10/2023 4:30 PM]
Summary: The project storage on Phoenix (/storage/coda1) is degraded, due to a failure of the hard drives. Access to the data is not affected; the scheduler continues to accept and process jobs.
Details: Two hard drives that are part of the Phoenix storage space failed on the morning of Friday, November 10, 2023 (the first drive failed at 8:05 am, and the second drive failed at 11:10 am). The operating system automatically activated some spare drives and started rebuilding the pool. During this process, file read and write operations by Phoenix users will take longer than usual. The rebuild is expected to end around 3 am on Saturday, November 11, 2023 (our original estimate of 7pm, Nov 10 2023, was too optimistic).
Impact: during the rebuilding process, file input/output operations are slower than usual. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.
We thank you for your patience as we are working on solving the problem.
[Update 11/8/2023 at 10:00 am]
Summary: Phoenix project storage experienced degraded performance overnight. PACE and our storage vendor made an additional configuration change this morning to restore performance.
Details: Following yesterday’s upgrade, Phoenix project storage became degraded overnight, though to a lesser extent than prior to the upgrade. Early this morning, the PACE team found that performance was slower than normal and began working with our storage vendor to identify the cause. We adjusted a parameter that handles migration of data between disk pools, and performance was restored.
Impact: Reading or writing files on the Phoenix project filesystem (coda1) may have been slower than usual last night and this morning. The prior upgrade mitigated the impact, so performance was less severely impacted. Home and scratch directories were not affected.
Thank you for your patience as we completed this repair.
[Update 11/7/2023 at 2:53 pm]
Dear Phoenix users,
Summary: The hardware upgrade of the Phoenix cluster storage was completed successfully. The cluster is back in operation. The scheduler is unpaused and the jobs that were put on hold are now resumed. Globus transfer jobs have also been resumed.
Details: In order to fix the issue with the slow response of the project storage, we had to bring the Phoenix cluster offline from 8am until 2:50pm, and to upgrade several hardware and firmware libraries on the /storage/coda1 file system. The engineers from the storage vendor company have been working with us through the upgrade and helped us ensure that the storage is operating correctly.
Impact: The storage on the Phoenix cluster can be accessed as usual, and the jobs can be submitted. The Globus and OpenOnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu.
Thank you for your patience!
The PACE team
[Update 11/3/2023 at 3:53 pm]
Dear Phoenix users,
Summary: Based on the significant impact the storage issues are causing to the community, the Phoenix cluster will be taken offline on Tuesday, November 7, 2023, to fix problems with the project storage. The offline period will start at 8am and is planned to be over at 8pm.
Details: To implement the fixes to the firmware and software libraries for the storage appliance controllers, we need to pause the Phoenix cluster for 12 hours, starting at 8am. Access to the file system /storage/coda1 will be interrupted while the work is in progress; Globus transfer jobs will also be paused while the fix is implemented. These fixes are expected to help improve the performance of the project storage, which has been below the normal baseline since Monday, October 30.
The date of Tuesday, November 7, 2023 was selected to ensure that an engineer from our storage vendor will be available to assist our team perform the upgrade tasks and monitor the health of the storage.
Impact: On November 7, 2023, no new jobs will start on the Phoenix cluster from 8 am until 8 pm. The job queue will be resumed after 8 pm. In case your job fails after the cluster is released at 8 pm, please resubmit it. This only affects the Phoenix cluster; the other PACE clusters (Firebird, Hive, ICE, and Buzzard) will be online and operate as usual.
Again, we greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!
[Update 11/3/2023 1:53 pm]
Dear Phoenix users,
Summary: Storage performance on Phoenix coda1 project space continues to be degraded.
Details: Intermittent performance issues continue on the project file system on the Phoenix cluster, /storage/coda1. This was first observed on the afternoon of Monday, October 30.
Our storage vendor found versioning issues with firmware and software libraries on storage appliance controllers that might be causing additional delays with data retransmissions. The mismatch was created when a hardware component was replaced during a scheduled maintenance period, that required the rest of the system to be upgraded to the same versions. Unfortunately, this step was omitted as part of the installation and upgrade instructions.
We continue to work with the vendor to define a proper plan to update all the components and correct this situation. It is possible we’ll need to pause cluster operations to avoid any issues while the fix is implemented; during this pause, the jobs will be put on hold, and will be resumed when the cluster is released. Again, we are working with the vendor to make sure we have all the details before scheduling the implementation. We’ll provide information on when the fix will be applied, and what to expect of the cluster performance and operations.
Impact: Simple file operations, including listing the files in a directory, reading from a file, saving a file, etc., are intermittently taking longer than usual (at its worst, the operation that is expected to take a few milliseconds runs in about 10 seconds). This affects the /storage/coda1/ project storage directories, but not scratch storage, nor any of the other PACE clusters.
We greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!
Thank you and have a great day!
-The PACE Team