PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.
Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.
Systems
- We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.
Storage
- Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)
Network
- Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
- Software upgrades to the PACE firewall appliance to address a known bug
- New subnets and re-assignment of IP addresses for some of the clusters
Power
- PDU fixes that are impacting 3 nodes in c29 rack
The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.