[Update 2/23/2022 3:00 PM]
Phoenix Lustre project and scratch storage performance has now recovered. You may now resume submitting jobs as normal. Thank you for your patience as we investigated the root cause.
This issue was caused by certain jobs engaging in heavy read/write on network storage. Thank you to the researchers we contacted for your cooperation in adjusting your jobs.
If your workflow requires extensive access of large files on multiple nodes, please contact us, and we will be happy to work with you to create a workflow that may speed up your research while simultaneously ensuring network stability. PACE will also continue to work on improvements to our systems and monitoring.
If your work requires generating temporary files during a run, especially if they are large and/or numerous, you may benefit from using local disk on Phoenix compute nodes. Writing intermediate files to local storage avoids network latency and can speed up your calculations while lessening load on the system. Most Phoenix nodes have at least 1 TB of local NVMe storage available, while our SAS nodes have at least 7 TB of local storage. At the end of your job, you can transfer only the relevant output files to network storage (project or scratch).
We apologize for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.
[Update 2/22/2022 5:28pm]
We are following up with an update on the Phoenix storage (project and scratch) slowness issues that have persisted since previous reporting. We have engaged our storage and fabric vendors as we work to address this issue. Based on our current assessment, we have identified possible problematic server racks, which are offlined. Scheduler remains online, but Phoenix is operating under reduced capacity, and we ask users to refrain from submitting new jobs unless they are urgent. We will continue to provide updates to users daily going forward as we work to address this issue.
Please accept our sincere apologies for this inconvenience, and if you have any questions or concerns, please do not hesitate to reach out to us at pace-support@oit.gatech.edu.
[Update 2/17/22 5:15 PM]
The PACE team and our storage vendor continue actively working to restore Lustre’s performance. We will provide updates as additional information becomes available. Please contact us at pace-support@oit.gatech.edu if you have any questions.
[Original Post 2/17/22 10:30 AM]
Summary: Phoenix Lustre project & scratch storage degraded performance
What’s happening and what are we doing: Phoenix project and scratch storage have been performing more slowly than normal since late yesterday afternoon. We have determined that the Phoenix Lustre device, hosting project and scratch storage, is experiencing errors and are working with our storage support vendor to restore performance.
How does this impact me: Researchers may experience slow performance using Phoenix project and scratch storage. This may include slowness in listing files in directories, reading files, or running jobs on Lustre storage. Home directories should not be impacted.
What we will continue to do: PACE is actively working, in coordination with our support vendor, to restore Lustre to full performance. We will update you as more information becomes available.
Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.