Dear Researchers,
As we continue to monitor our network closely after the recent issues with our network/InfiniBand, we wanted to alert you about a brief network glitch from this afternoon that’s impacted the connection between the GPFS and the compute nodes as well as node to node communication.
What happened and what we did: At 12:45pm, we started to experience issues in connection between our two main InfiniBand switches that GPFS connects to along with compute nodes. We observed various errors that we were able to quickly diagnose, and by 1:55pm we resolved the issues after rebooting one of the main switches.
Who is impacted: During this brief network glitch, users may have experienced slow read/write and/or errors on GPFS directories from the compute nodes. This may have impacted running MPI jobs. We encourage users to check on their running jobs from earlier this afternoon, and resubmit any jobs that may have been interrupted.
What we will continue to do: We will continue to monitor the network and report as needed. We appreciate your continued understanding and patience during these recent network interruptions. Please rest assured that we are doing everything we can to keep this network fabric operational.
If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.