Update (3/22, 10:00AM): The initial findings point to hardware issues, but we don’t have a conclusive diagnosis yet. The vendor is collecting new logs to better understand the issue. We have been fixing some of the issues we found in the network and wondering if they made any difference at all. If you have opened tickets with us, please give us an update on your current experience, whether it’s better, same or worse.
Data is everything when it comes to computing and we certainly understand how these issues can have a big impact on your research progress. We are doing everything we can, with the support of the vendor, to resolve these issues ASAP.
Thank you for your feedback, cooperation and patience.
Update (3/21, 8:00PM): We continue to work with the vendor and found several issues to fix, but the system is not fully stabilized yet. Please keep an eye on this post for more updates.
Original Post:
The storage slowness issues that were initially reported on headnodes seem to be impacting some of the compute nodes as well. We are actively working to address this issue with some guidance from the vendor.
If your jobs are impacted, please open a ticket with pace-support@oit.gatech.edu and report the job IDs. This will allow us to identify specific nodes that could be contributing to the problem.
The intermittent nature of the problem is making troubleshooting difficult. We’d appreciate your patience while we are trying to identify the culprit.
Thank you.