PACE A Partnership for an Advanced Computing Environment

September 8, 2011

Problems Continued (8 Sept)

Filed under: tech support — pm35 @ 11:32 am

0720

Our replacement Force-10 switch is expected to be delivered before noon by FedEx.  While waiting, we have continued working with Panasas overnight (between short sleeps) providing them with updated information.   We’ll continue to update this blog as we have further information.

1500

The replacement was received, installed but the problems remain.  Panasas and Penguin are both on-site and are working to restore service with remote support.  We have escalated this to the highest levels at both corporations and believe they are doing their best at this time to help restore service.  We will update this blog and send emails once we have some positive news.

2100

After replacement of the Force-10 switch, we continued to have network instability issues with the Panasas.  After much sleuthing, it was discovered there is an interaction between the cable type used to connect the Panasas devices to the Force-10 switch and the switch itself.  This was a bit difficult to diagnose due to the intermittent nature of the failures where some cable paths would work fine for some long period of time then partially corrupt packets for some short period of time, then be fine once again.  The corruption time and significance was sufficient to alert the Panasas software there was a problem with one or more of the units and it would attempt recovery.  Part-way through the recover process, the data path would be fine but often another path would begin to fail in a similar mode.  Unusual behaviour for any cables at best and not seen before with this cable type.  This interaction is now logged in both the Panasas and Force-10 knowledge archives.

Once the cables were replaced, there remained some significant problems with the file systems themselves.  Again, there is no loss or corruption of data.  Just the volumes of information being moved automatically at too-frequent an interval.  After the Panasas realm was settled, it is a long task to re-certify the data partitions and insure the data is correct.  This process is ongoing now and will continue for some while overnight as it re-certifies the many TB of data.

If all goes according to plan, we will arrive in the morning to a still-stable Panasas storage and will begin the recovery of the cluster operation.  Expectation is we will have the cluster back in operation by early afternoon, earlier if at all possible.  Once we have successfully tested the cluster, we will restart all the scheduler services and announce both here and via the mailing lists. Hopefully, we’ll have the good word in the morning.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress