PACE A Partnership for an Advanced Computing Environment

September 2, 2017

GPFS problem (resolved)

Filed under: Uncategorized — admin @ 12:52 am

This was much ado about nothing.  Running jobs continued to execute normally through this event, and no data was at risk.  What did happen is that jobs that could potentially have started were delayed.

A longer explanation –

We have monitoring agents that prevent jobs from starting if they detect a potential problem with the system.  The idea is to avoid starting a job if there’s a known reason that would cause a crash.  During our last maintenance period, we brought a new DDN storage system online and configured these agents to watch it for issues.  It did develop an issue, the monitoring agents flagged it and took nodes offline to new jobs.  However, we have yet to put any production workloads on this new storage so no running jobs were affected.

At the moment, we’re pushing out a change to the monitoring agents to ignore the new storage.  As this finishes rolling out, compute nodes will come online and resume normal processing.  We’re also working with DDN to address the issue on the new storage system.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress