Yesterday (10/26), early evening (4:50pm), it appears one of our primary storage units decided to have a serious crash (page fault in the kernel, if you wanted more detail), and that proceeded to offline a good share of the storage allocated to supporting our VM infrastructure. Since most of the head nodes we run are in fact VMs, this of course meant that the head nodes themselves started having problems handling new job requests and allowing logins.
Please note, any submitted jobs were not affected, only jobs that were in the process of submission around 4:50pm yesterday until 8:30am this morning.
We have restored functionality to this array and will be submitting tickets with the vendor shortly to evaluate what has occurred on the machine, and any remediations we can apply. We may need to reboot the head nodes affected by this to get them to their proper state as well, but we are evaluating where we are before making that call.
UPDATE 1:
Unfortunately, upon review, we will have to restart the head node VMs, and that process will start immediately so that folks can submit jobs as soon as possible.
UPDATE 2:
With the engagement of the vendor, we have identified the likely cause of this problem which will ultimately be addressed during our January Maintenance, due to its requirement for a reboot (which would be service interrupting right now). Thankfully, a work-around for the bug that we could apply without requiring a reboot is available and should keep the system stable until then. At this time, we have enacted that work-around.