[Update – October 5, 2018] We worked with our vendor to address the issue impacting the network shared disk (NSD) that drastically reduced the performance of pace1 file system when it was stressed by the large number of I/O intensive jobs. On Thursday, we had the NSD restored to normal, and our benchmarks indicate a successful resolution. As a precaution, we will continue to monitor NSDs as the user workloads continue to resume to normal.
[Original Post – October 3, 2018]
On Monday, October 1, we started to experience slowness on our parallel file system (pace1), which was associated with users’ I/O intensive jobs. We have engaged the users who were/are responsible for the load. During this process, the stress on our storage and network allowed us to identify a bug with a network shared disk that is responsible for caching data that improves read/write speeds. Currently, we have successfully deployed a workaround, which has dramatically improved the performance, and we are working with our vendor to further resolve this issue.
With this development, symptoms that you may have experienced is slowness when navigating through your files. Your jobs should not have been impacted other than slower access to the files that may have resulted in longer execution times (i.e., wall-time).
We will update you once we have the issue fully resolved in collaboration with our vendor. If you have any questions, please don’t hesitate to contact us at pace-support@oit.gatech.edu