1530
After days of continuous struggle and troubleshooting, we are happy to tell you that the clusters are finally back in a running state. You can now start submitting your jobs. All of your data have been safe, however the jobs that were running during the incident were killed and they need to be restarted. We understand how this interruption must have adversely impacted your research and apologize for all the trouble. Please let us (pace-support@oit.gatech.edu) know if there is anything we can do to bring you up to speed once again.
The brief technical explanation of what happened:
At the heart were a set of fiber optic cables that interacted to intermittently interrupt communications among the Panasas storage modules. This would result in the remaining modules beginning to move the services handled by a non-communicating module to a backup location. During the process of moving the service, one of the other modules (including the one accepting the new service) would either send or receive some garbled information causing the move now in process to be re-recovered or an additional service to be relocated, depending upon which modules were involved. Interestingly, the cables themselves appear not to be bad but instead interacted badly with the networking components. Thus, when cables were replaced or switch ports or network switch itself were swapped, the problems would appear “fixed” for a short while then return before a full recovery could be completed. The three vendors involved provided access to their top support and engineering resources and these have never seen this kind of behavior. Our experience and adversity have been entered into their knowledge bases for future diagnostics.
Thank you once again for your understanding and patience!
Regards,
PACE Team