PACE A Partnership for an Advanced Computing Environment

March 26, 2012

Regarding the job scheduler problems over the weekend

Filed under: tech support — Semir Sarajlic @ 4:24 pm

We experienced a major problem with one of our file servers over the weekend, which caused some of your jobs to fail. We would like to apologize for this inconvenience and provide you with more details on the issue.

In a nutshell, the management blade of our file server we use for scratch space (iw-scratch) crashed for a reason that we are still investigating. This system has a failover mechanism, which allows another blade to take over for continuation of operations. Therefore, you were still able to see your files and could use the software stack that is on this fileserver.

Our node that runs the moab server (job scheduler), on the other hand, mounts this fileserver using another mechanism that uses a static IP. After the new blade took over the operations, our Moab node continued to try mounting the iw-scratch using the IP of the failed blade, needless to say, unsuccessfully.

As a result, some jobs failed with messages similar to “file not found”. This problem also rendered the moab server unresponsive, until we rebooted it Saturday night. Even after the reboot, some problems persisted until we fixed the server this morning. We will keep you updated as we find more about the nature of the problem. We are also in contact with the vendor company to prevent this from happening again.

Thank you once again for your understanding and patience. Please contact us at pace-support@oit.gatech.edu for any questions and concerns.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress