GT Home : : Campus Maps : : GT Directory

GPFS problem (resolved)

This entry was posted by on Saturday, 2 September, 2017 at

This was much ado about nothing.  Running jobs continued to execute normally through this event, and no data was at risk.  What did happen is that jobs that could potentially have started were delayed.

A longer explanation –

We have monitoring agents that prevent jobs from starting if they detect a potential problem with the system.  The idea is to avoid starting a job if there’s a known reason that would cause a crash.  During our last maintenance period, we brought a new DDN storage system online and configured these agents to watch it for issues.  It did develop an issue, the monitoring agents flagged it and took nodes offline to new jobs.  However, we have yet to put any production workloads on this new storage so no running jobs were affected.

At the moment, we’re pushing out a change to the monitoring agents to ignore the new storage.  As this finishes rolling out, compute nodes will come online and resume normal processing.  We’re also working with DDN to address the issue on the new storage system.

Comments are closed.