PACE A Partnership for an Advanced Computing Environment

June 20, 2024

[OUTAGE] Phoenix Project Storage

Filed under: Uncategorized — Eric Coulter @ 1:36 pm

[Update 06/20/2024 04:58pm]

Dear Phoenix Users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused and the jobs that have been put on hold are now resumed, and the file system is ready for use.

Details: All the appliance components for Phoenix project storage were restarted, and file system consistency was confirmed. We’ll continue to monitor it and run additional consistency checks over the next few days.

Impact: If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs, so please reach out to pace-support@oit.gatech.edu if you have encountered any issues.

Thank you for your patience,

-The PACE Team

[Update 06/20/2024 01:36 pm]

Summary: The metadata servers on Phoenix, for project storage, /storage/coda1, are currently down due to degraded performance.

Details: During additional testing with the storage vendor as part of investigation of the performance issues from this morning, it was necessary to bring the storage fully offline, rather than resuming service.

Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler once we have verified that project storage is stable. Access to project storage (/storage/coda1) is currently interrupted, however, scratch storage (/storage/scratch1) is not affected. If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs as usual.

Only project storage on Phoenix is affected – storage on Hive, ICE, Buzzard and Firebird work without issues.

Thank you for your patience as we work with our storage vendor to resolve this outage. We will continue to provide updates as work continues.

Please contact us at pace-support@oit.gatech.edu with any questions.

Degraded Phoenix Project Storage Performance

Filed under: Uncategorized — Jeff Valdez @ 10:29 am

Summary: The metadata servers on Phoenix, /storage/coda1, restarted by themselves, with one of them not responding, leading to degraded performance on the project storage file system.

Details: We have restarted the servers in order to restore access. Testing performance of the file system is ongoing. We will continue to monitor performance and work with the vendor to find the cause.

Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler soon once we have verified that storage is stable. Access to project storage (/storage/coda1) might have been interrupted for some users. If you are running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. Only storage on Phoenix should be affected; storage on Hive, ICE, Buzzard and Firebird work without issues.

June 18, 2024

IDEaS Storage Outage Resolved

Filed under: Uncategorized — Michael Weiner @ 10:13 am

Summary: PACE’s IDEaS storage was unreachable early this morning. Access was restored at approximately 9:00 AM.

Details: One controller on the IDEaS IntelliFlash storage became unresponsive, and the resource could not switch to the redundant controller. Rebooting both controllers restored access. PACE is working with our storage vendor to identify the cause.

Impact: IDEaS storage could not be reached during the outage from PACE and external mounts. Any jobs on Phoenix or Hive running on IDEaS storage would have failed. If you had a job on Phoenix running on IDEaS storage that failed, please email pace-support@oit.gatech.edu to request a refund.

Thank you for your patience as we resolved the issue this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

June 7, 2024

Hive Storage Maintenance

Filed under: Uncategorized — Jeff Valdez @ 4:21 pm

WHAT’S HAPPENING?

One of the storage controllers in use for Hive requires a hard drive replacement to restore the high availability of the device. The activity takes about 2 hours to complete. 

WHEN IS IT HAPPENING?

Tuesday, June 11th, 2024, starting at 10 AM EDT.

WHY IS IT HAPPENING?

The failed drive limits the high availability of the controller.

WHO IS AFFECTED?

Users of the Hive storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens. 

WHAT DO YOU NEED TO DO?

During hard drive replacement for the Hive cluster, one of the controllers will be shut down, and the redundant controller will take all the traffic. Data access should be preserved, and we do not expect downtime, but there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Powered by WordPress