We identified a problem with the way some nodes are mounting our main (GPFS) storage server, causing slow storage performance. The fix requires restarting the storage services on affected nodes individually, when they are not running any jobs. For this reason, we started draining (offlining) all affected nodes and systematically bringing them back online as soon as their jobs are complete and the fix is applied.
This issue does not impact running jobs other than storage slowness, but you will notice offline nodes in your queues until we address all affected nodes.
It’s safe to continue submitting jobs and there is no risk of data loss.
We are sorry for this inconvenience and thank you for your cooperation.