Hi folks,
We’ve detected a problem with the Infiniband fabric today and are working with the vendor for a best course of action. I wanted to give everybody a heads up in case our temporary solution doesn’t hold.
What we saw this morning, was the inability of the subnet manager to establish new connections over the IB network. We do not believe that running jobs with existing connections were affected, but it is possible for new MPI jobs using mvapich would fail to start.
Unfortunately, the redundant management module we’ve already ordered for the big switch isn’t here yet so we have configured a secondary software-based subnet manager in the hopes that this will be enough of a temporary solution to get us to the April maintenance window when we can do a controlled downtime. If for some reason, this solution doesn’t hold, our next step is to reboot the big switch – which will likely cause pretty substantial disruption for existing Infiniband jobs.
If you see failures of MPI jobs from here on out, please let us know via the usual pace-support@oit.gatech.edu method.