PACE A Partnership for an Advanced Computing Environment

November 21, 2017

Systematic offlining of PACE nodes to address storage slowness

Filed under: Uncategorized — Semir Sarajlic @ 2:34 pm

We identified a problem with the way some nodes are mounting our main (GPFS) storage server, causing slow storage performance. The fix requires restarting the storage services on affected nodes individually, when they are not running any jobs. For this reason, we started draining (offlining) all affected nodes and systematically bringing them back online as soon as their jobs are complete and the fix is applied.

This issue does not impact running jobs other than storage slowness, but you will notice offline nodes in your queues until we address all affected nodes.

It’s safe to continue submitting jobs and there is no risk of data loss.

We are sorry for this inconvenience and thank you for your cooperation.

November 4, 2017

PACE clusters ready for research

Filed under: Uncategorized — Semir Sarajlic @ 2:49 am

Our November 2017 maintenance period is now complete, far ahead of schedule.  We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, February 8 through Saturday, February 10, 2018.

Storage
– Nearly a petabyte of data was migrated to the new DDN/GPFS storage device.  While this will provide a more performant, expandable, and supportable storage platform, it requires changes to path names.  We have adjusted the symbolic links in home directories (e.g. ~/data) to point to the new locations, please continue to use these names wherever possible.  In order to minimize disruption, we have also put a temporary redirection in place so that the old names will continue to work.  We intend to remove this redirection during our next maintenance period, and will proactively identify and assist users using the deprecated path names.

Schedulers
– The nvidia-gpu and gpu-recent queues have been consolidated into a new force-gpu queue.  Please use the new queue name going forward.  PACE staff will proactively identify and assist users using the deprecated queue names.
– The semap-6 queue has been moved to an alternate scheduler server.  No user action is required.
– The Joe cluster has been moved into the shared partition.  These users now have access to idle cycles in the shared partition, and offer the idle cycles of their cluster for use by others.

ITAR / NIST800-171 environment
– planned tasks are complete, no user action is required.

Power and Network
– planned tasks are complete, no user action is required.

Powered by WordPress