PACE A Partnership for an Advanced Computing Environment

October 27, 2014

Major Storage Issue (why were the head nodes unavailable?)

Filed under: tech support — Semir Sarajlic @ 1:32 pm

Yesterday (10/26), early evening (4:50pm), it appears one of our primary storage units decided to have a serious crash (page fault in the kernel, if you wanted more detail), and that proceeded to offline a good share of the storage allocated to supporting our VM infrastructure. Since most of the head nodes we run are in fact VMs, this of course meant that the head nodes themselves started having problems handling new job requests and allowing logins.

Please note, any submitted jobs were not affected, only jobs that were in the process of submission around 4:50pm yesterday until 8:30am this morning.

We have restored functionality to this array and will be submitting tickets with the vendor shortly to evaluate what has occurred on the machine, and any remediations we can apply. We may need to reboot the head nodes affected by this to get them to their proper state as well, but we are evaluating where we are before making that call.

UPDATE 1:
Unfortunately, upon review, we will have to restart the head node VMs, and that process will start immediately so that folks can submit jobs as soon as possible.

UPDATE 2:
With the engagement of the vendor, we have identified the likely cause of this problem which will ultimately be addressed during our January Maintenance, due to its requirement for a reboot (which would be service interrupting right now). Thankfully, a work-around for the bug that we could apply without requiring a reboot is available and should keep the system stable until then. At this time, we have enacted that work-around.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress