PACE A Partnership for an Advanced Computing Environment

May 25, 2017

Infiniband switch failure causing partial network and storage unavailability

Filed under: Uncategorized — Semir Sarajlic @ 8:47 pm
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress