PACE A Partnership for an Advanced Computing Environment

March 21, 2011

Infiniband Interruption

Filed under: tech support — Tags: — pm35 @ 9:35 pm

Approximately 10:15am this morning, we received a trouble alert with the Infiniband fabric. We found the fabric subnet manager (SM) unresponsive and restarting the system proved ineffective. A backup system was brought online and a secondary SM brought online as well.

Service was restored by 11:15. A secondary system with an auto-failover SM is now running to protect against this outage. A service request has been logged with the vendor. As always, please let us know if you observe any lingering difficulties.

March 8, 2011

Heads up – Infiniband trouble

Filed under: tech support — Tags: — admin @ 7:10 pm

Hi folks,

We’ve detected a problem with the Infiniband fabric today and are working with the vendor for a best course of action.  I wanted to give everybody a heads up in case our temporary solution doesn’t hold.

What we saw this morning, was the inability of the subnet manager to establish new connections over the IB network.  We do not believe that running jobs with existing connections were affected, but it is possible for new MPI jobs using mvapich would fail to start.

Unfortunately, the redundant management module we’ve already ordered for the big switch isn’t here yet so we have configured a secondary software-based subnet manager in the hopes that this will be enough of a temporary solution to get us to the April maintenance window when we can do a controlled downtime.  If for some reason, this solution doesn’t hold, our next step is to reboot the big switch – which will likely cause pretty substantial disruption for existing Infiniband jobs.

If you see failures of MPI jobs from here on out, please let us know via the usual pace-support@oit.gatech.edu method.

Powered by WordPress