PACE A Partnership for an Advanced Computing Environment

February 19, 2024

Outage on Scratch Storage on the Phoenix Cluster

Filed under: Uncategorized — Michael Weiner @ 10:35 am

[Update 02/19/24 10:47 AM]

Summary: The Phoenix /storage/scratch1 file system is operational. The performance is stable. The scheduler has been un-paused, current jobs continue to run, and new jobs are being accepted. 

Details: The storage vendor provided us with a hot fix late Friday evening that was installed this morning on the Lustre appliance supporting /storage/scratch1. The performance test of the scratch file system after the upgrade was stable. We are releasing the cluster and the Slurm scheduler. The Open OnDemand services are back to normal. 

The cost of all jobs running between 6PM on Wednesday, February 14, and 10AM on Monday, February 19, will be refunded to the PI’s accounts. 

During the weekend, an automatic process accidentally resumed the scheduler and some jobs started to run. If you have a job that ran during the outage and used scratch, please consider re-running it from the beginning, because, if your job was running before the hot fix was applied, it is possible some processes failed trying to access the scratch file system. The cost of the jobs that were accidentally re-started during the outage will be refunded.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and jobs can be submitted. The Globus and Open OnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu.   

Thank you for your patience! 

[Update 02/16/24 05:58 PM]

PACE has decided to leave the Slurm scheduler paused, and no jobs will be accepted over the weekend. We will allow jobs that are currently running to continue, but those utilizing scratch may fail.

While this was a difficult call to decide on keeping the job scheduling paused during the weekend, we want to ensure that the issues with scratch storage will not impact the integrity of other components on Phoenix. 

We are not confident that functionality can be restored without further input from the storage vendor. As part of continuing the diagnostic process, we expect we will have no other option but to reboot the scratch storage system on Monday morning. As a result, any jobs still running at that point that utilize scratch storage will likely fail. We have continued to provide diagnostic data that the vendor will analyze during the weekend. We plan to provide an update on the state of the scratch storage by next Monday (2/19) at noon.

We will refund all jobs that ran from the start of the outage on Wednesday evening 6:00 pm until performance is restored. 

Monthly deletion of old files in scratch, scheduled for Tuesday, February 20, has been canceled. All researchers who have received notifications for February will be given a one-month extension automatically. 

Finally, while you cannot schedule jobs, you may be able to log on to Phoenix to view or copy files. However, please be aware that you may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. We recommend avoiding using file commands like “ls” of your home (~) or scratch (~/scratch) directories as that may lead to your command prompt stalling. 

You may follow updates to this incident on the GT OIT Status page.  

We recognize the negative impact this storage disruption has on the research community, especially given that some of you may have research deadlines. Thank you for your patience as we continue working to fully restore scratch storage system performance. If you have additional concerns, please email ART Executive Director, Didier Contis, directly at didier.contis@gatech.edu

[Update 02/16/24 02:59 PM]

Unfortunately, the scratch storage on the Phoenix cluster remains currently unstable. You may see long delays with simple commands (ls/cd), creating new files/folders, and editing the existing ones. Jobs that are currently running from scratch might be experiencing delays. We are continuing to work on resolving the issue and we are in close communication with the storage vendor. The scheduler remains paused, and no new jobs are being accepted. We will provide an update on the state of the scratch storage by this evening. We sincerely apologize for the inconvenience that the current outage is causing you. 

Thank you for your patience.

[Update 02/16/24 09:15 AM]

Summary: Phoenix /storage/scratch1 file system continues to have issues for some users. The recommended procedure is to fail over the storage services to the high availability pair and reboot the affected component. This will require pausing the Phoenix scheduler. 

Details: After analyzing the storage logs, the vendor recommended that the affected component is rebooted, moving all the services and connections to the high availability pair. While the device restarts, the Phoenix scheduler will be paused. Running jobs will see a momentary pause accessing the /storage/scratch1 file system while the connections are moved to the redundancy device. Once the primary device is up and running and all the errors have cleared, the services will be switched back, and the jobs scheduling will be resumed. 

We will start this procedure at 10:00am EDT. Please wait for the all-clear message before starting additional jobs on the Phoenix cluster. 

Impact: Jobs on Phoenix will be paused during the appliance restart procedure; running jobs should continue with some delays while the connections are switched over. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 04:56 PM]

Summary: Phoenix /storage/scratch1 file system is now stable for most users. A small number of users are still experiencing issues. 

Details: While we continue working with the vendor to get to the root cause of the issue, all diagnostic tests executed through the day have been successful. However, there is a small number of users who have running jobs from their scratch folder that continue to notice slowness accessing their files. 

Please inform us if you are seeing degraded performance on our file systems. As mentioned, we continue the efforts to find a permanent solution. 

Impact: Access to /storage/scratch1 is normal for the majority of users; please let us know if you are still experiencing issues by emailing us at pace-support@oit.gatech.edu. OnDemand-Phoenix and the scheduler are working fine. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions. 

[Update 02/15/24 11:07 AM]

Summary: Phoenix /storage/scratch1 file system has intermittent issues. Jobs running from the scratch storage might be stuck. 

Details: Around 5:00 PM yesterday (February 14, 2024), the Lustre filesystem hosting /storage/scratch1 on the Phoenix cluster became inaccessible. We restarted the services at 8AM today (February 15, 2024) but some accessibility issues remain. The PACE team is investigating the cause and the storage vendor has been contacted. This may cause delays and timeouts on interactive sessions and running jobs. 

Impact: Access to /storage/scratch1 might be interrupted for some users. Running ‘ls’ on Phoenix home directories may hang as it attempts to resolve the symbolic link to the scratch directory. OnDemand-Phoenix was also affected; as of this writing, it is stable, and we continue to monitor it. Jobs using /storage/scratch1 may be stuck. The output of the `pace-quota` command might hang as scratch utilization is checked and might show the incorrect balance. There is no impact to the Hive, ICE, Firebird, or Buzzard clusters. You may follow updates to this incident on the GT OIT Status page. 

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress