PACE A Partnership for an Advanced Computing Environment

March 31, 2025

[Update] [storage] Phoenix Project storage degraded performance

Filed under: Uncategorized — rlombardi6 @ 12:51 pm

[Updated March 31, 2025 at 414pm]

Dear Phoenix researchers,

As the Phoenix project storage system has stabilized, we have restored login access via ssh and resumed starting jobs.

The cost for the jobs running during the performance degradation will not count towards the March usage.

The Phoenix OnDemand portal can again be used to access project and scratch space. Any user still receiving a “Proxy Error” should contact pace-support@oit.gatech.edu for an individual reset of their OnDemand session.

Globus file transfers have resumed. We have determined that transfers to/from home, scratch, and CEDAR storage were inadvertently paused, and we apologize for any confusion. Any paused transfer should have automatically resumed.

The PACE team continues to monitor the storage system for any further issues. We are working with the vendor to identify the root cause and prevent future performance degradation.

Please contact us at pace-support@oit.gatech.edu with any questions. We appreciate your patience during this unexpected outage.

Best,

The PACE Team

[Updated March 31, 2025 at 12:41pm]

Dear Phoenix Users,

To limit the impact of the current Phoenix project filesystem issues, we have implemented the following changes to expedite troubleshooting and limit impact to currently running jobs:

New Logins to Phoenix Login Nodes are Paused

We have prevented new login attempts to the Phoenix login nodes. Users that are currently logged in will be able to stay logged onto the system.

Phoenix Jobs Prevented from Starting

Jobs that are in the queue but that have not yet started have been paused to prevent them from starting. These submitted jobs will remain in the queue.

Jobs that are currently running may experience decreased performance if using project storage. We are doing our best to prioritize the successful completion of these jobs.

Open OnDemand (OOD)

Users of Phoenix OOD can log in and interact with only their home directory. Project and scratch space are not available.

Some users of Open OnDemand may be unable to reach this service and are experiencing “Proxy Error” messages. We are investigating the root cause of this issue.

Globus File Transfer Paused for Project Space

File transfers to/from project storage on Globus have been paused. Other Globus transfers (Box, DropBox, and OneDrive cloud connectors; scratch; home; and CEDAR) will continue.

The PACE team is working to diagnose the current issues with support from our filesystem vendor. We will continue to share updates as we have them and apologize for this unexpected service outage.

Best,

The PACE Team

[storage] Phoenix Project storage degraded performance

Filed under: Uncategorized — Eric Coulter @ 9:14 am

We are currently experiencing degraded performance on Phoenix Project storage. We are investigating with the vendor and will provides updates as we learn more. 

Summary: Performance of Phoenix project storage is currently degraded.

Details: Two of our MDS (MetaData Servers) rebooted early Monday morning, March 31, and load averages are unusually high on one of them.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until we are able to mitigate the issue. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.”

March 19, 2025

Phoenix storage performance degraded

Filed under: Uncategorized — Michael Weiner @ 5:01 pm

[Update 3/21/25 12:30 PM]

Following the completion of the rebuild and copyback processes on the impacted redundant storage pool, Phoenix project storage performance has returned to normal. Please contact pace-support@oit.gatech.edu if you encounter any further issues.

[Original post 3/19/25 5:00 PM]

Summary: Performance of Phoenix project storage is currently degraded.

Details: Multiple redundant disks failed yesterday and today, and storage is slowed while the redundant pool rebuilds.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until the process is complete. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.

Please visit https://status.gatech.edu for updates and contact pace-support@oit.gatech.edu with any questions.

Powered by WordPress