PACE A Partnership for an Advanced Computing Environment

May 25, 2017

Infiniband switch failure causing partial network and storage unavailability

Filed under: Uncategorized — Semir Sarajlic @ 8:47 pm
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

May 12, 2017

PACE clusters ready for research

Filed under: Uncategorized — admin @ 9:58 pm

Our May 2017 maintenance period is now complete, far ahead of schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and data available. As usual, there are some straggling nodes we will address over the coming days.

Our next maintenance period is scheduled for Thursday, August 10 through Saturday, August 12, 2017.

New operating system kernel

  • All compute, interactive, and head nodes have received the updated kernel. No user action needed.

DDN firmware updates

  • This update brought low level firmware on drives up to date per recommendation from DDN. No user action needed.

Networking

  • DNS/DHCP and firewall updates per vendor recommendation applied by OIT Network Engineering.
  • IP address reassignments for some clusters completed. No user action needed.

Electrical

  • Power distribution repairs completed by OIT Operations. No user action needed.

May 8, 2017

PACE quarterly maintenance – May 11, 2017

Filed under: Uncategorized — Semir Sarajlic @ 5:15 pm

PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Systems

  • We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.

Storage

  • Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)

Network

  • Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
  • Software upgrades to the PACE firewall appliance to address a known bug
  • New subnets and re-assignment of IP addresses for some of the clusters

Power

  • PDU fixes that are impacting 3 nodes in c29 rack

The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.

Powered by WordPress