Our April maintenance window is now complete. As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.
We did run into an unanticipated maintenance item with the GPFS storage – no data has been lost. As we’ve added disks to the DDN storage system, we’ve neglected to perform a required rebalancing operation to spread load amongst all the disks. The rebalancing operation has been running over the majority of our maintenance window, but the task is large and progress has been much slower than expected. We will continue to perform the rebalancing during off-peak times in order to mitigate the impact on storage performance as best we are able.
Removal of /nv/gpfs-gateway-* mount points
Task complete as described. The system should no longer generate these paths. If you have used these paths explicitly, your jobs will likely fail. Please continue to use paths relative to your home directory for future compatibility. (e.g. ~/data, ~/scratch, etc.)
New GPFS gateway
Task complete as described
GPFS server and client tuning
Task complete as described
Decommission old Panasas scratch
Task complete as described. Paths starting with /panfs no longer work. Everybody should have been transitioned to the new scratch long ago, so we do not expect anybody to have issues here.
Enabling debug mode
Task complete as described. You may see additional warning messages if your code not well behaved with regards to memory utilization. This is a hint that you may have a bug.
Removal of compatibility links for migrated storage
Task complete as described. Affected users (Prometheus and CEE clusters) were contacted before maintenance day. No user impact is expected, but please send in a ticket if you think there is problem.
Scheduler updates
Task complete as described
Networking Improvements
Task complete as described
Diskless node transition
Task complete as described
Security updates
Task complete as described