The network switch that feeds most of the FoRCE cluster, and a few nodes of Atlantis and Joe exceeded its safe temperature threshold this afternoon and powered off. If you have jobs running on any of the affected nodes, please check on them.
We have some temporary cooling measures in place now, and we’ll address the long term long term solution in the morning.
Affected nodes are:
- FoRCE cluster – iw-h41-*
- Atlantis cluster – iw-h41-31[g-j]
- Joe cluster – joe99 – joe104