[Resolved, January 28] We had one of our main Mellanox IB switch’s partially go down on Sunday morning, which has left large amount of compute nodes without access to the IB interconnect. Our system engineers have resolved the matter at about 9:41am, and the IB switch is back online. As far as we know, the following queues have been impacted: athena-intel, atlantis, atlas-6-sunge, atlas-intel, force-6, joe-intel, joe-test, novazohar,, pace-devel, swarm, and zohar. We advise that you review your jobs from this weekend/current jobs as this incident may have interrupted your jobs. If your jobs have failed due to errors pertaining to MPI errors or files could not write to /scratch/ or /data/[Your_Files], then please resubmit your jobs.
We will continue to monitor this switch and update if needed. If you experience any further issues, please contact pace-support@oit.gatech.edu.
Thank you, and sorry for this inconvenience.