GT Home : : Campus Maps : : GT Directory

PACE quarterly maintenance – (Aug 10-12, 2017)

This entry was posted by on Thursday, 3 August, 2017 at

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Aug 10 through the the end of Saturday (Aug 12). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Storage (no user action needed)

We are working with the vendor to reconfigure our multiple GPFS filesystems for performance fine-tuning and better support for large scale deployments. Some of these configurations will be applied on the maintenance day because they require downtime.

Network (no user action needed)

PACE infiniband (IB) network requires some configuration changes to handle the rapid growth in the number of PACE nodes. We have identified several configuration parameters that will potentially reduce the occurrence of nodes losing their connectivity to GPFS (which relies on the IB network), causing intermittent job crashes.

Schedulers (postponed, no user action needed)

We have communicated our plans to upgrade scheduler in several occasions in the past, but skipped this task in the past maintenance days due to the bugs that we had uncovered. Despite the promising progress on the resolution of these bugs by the vendor, they are not fully resolved and tested yet. For this reason, we decided to once again postpone upgrading plans and keep the current versions until we have a bug-free and well-tested version.

Scheduler-based monitoring and analysis (no user action needed)

PACE research scientists started a collaboration with Texas Advanced Computing Center (TACC) to develop a new tool to analyze scheduler logs to gain insights about usage trends. For more details, please check https://doi.org/10.1145/3093338.3093351

This tool heavily relies on a widely used utility named ‘PBSTools’, which is developed by Ohio Supercomputer Center. Our current installation of PBSTools is old, buggy and very slow. We will upgrade this tool and its database on the maintenance day to ensure that no job info will be lost during transition.

Power Work  (no user action needed)

PACE and the OIT Operations team will perform some work in the datacenter to improve electrical capacity and balance. This includes moving and/or replacing some of the in-rack Power Distribution Units (PDUs).

Software Repository (no user action needed)

Our NIST SP800-171 conforming secure cluster has a specific need for a dedicated copy of the PACE software repository. We have previously completed an initial replication of the files and will simply point the nodes to the replica during the maintenance window.

 

 

 

 

Comments are closed.