GT Home : : Campus Maps : : GT Directory

Infiniband switch failure causing partial network and storage unavailability

This entry was posted by on Thursday, 25 May, 2017 at
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

Comments are closed.