2014-07-15 at 6am: Maintenance has begun
Hi folks,
It is time again for our quarterly maintenance. We have a bit of a situation this time around and will need to extend our activities into a third day – starting at 6:00am Tuesday, July 15 and ending at or before 11:59pm Thursday, July 17. This is a one-time event, and I do not expect to move to three-day maintenance as a norm. Continue reading below for more details.
Over the years, we’ve grown quite a bit and filled up one side of our big Infiniband switch. This is a good thing! The good news is that there is plenty of expansion room on the other side of the switch. The bad news is that we didn’t leave a hole in the raised floor to get the cables to the other side of the switch. In order to rectify this, and install all of the equipment that was ordered in June, we need to move the rack that contains the switch as well as some HVAC units on either side. In order to do this, we need to unplug a couple hundred Infiniband connections and some ethernet fiber. Facilities will be on hand to handle the HVAC. After all the racks are moved, we’ll swap in some new raised-floor tiles and put everything back together. This is a fair bit of work, and is the impetus for the extra day.
In addition, we will be upgrading all of the RedHat 6 compute nodes and login nodes from RHEL6.3 to RHEL6.5 – this represents nearly all of the clusters that PACE manages. This image has been running on the TestFlight cluster for some time now – if you haven’t taken the opportunity to test your codes there, please do so. This important update contains some critical security fixes to go along with the usual assortment of bug fixes.
We are also deploying updates to the scheduler prologue and epilogue scripts to more effectively combat “leftover” processes from jobs that don’t completely clean up after themselves. This should help reduce situations where jobs aren’t started because compute nodes incorrectly appear busy to the scheduler.
We will also be relocating some storage servers to prepare for incoming equipment. There should be no noticeable impact to this move, but just in case, the following filesystems are involved:
- /nv/pase1
- /nv/pb2
- /nv/pc6
- /nv/pcoc1
- /nv/pface1
- /nv/pmart1
- /nv/pmeg1
- /nv/pmicro1
- /nv/py2
Among other things, these moves will pave the way for another capacity expansion of our DDN project storage, as well as a new scratch filesystem. Stay tuned for more details on the new scratch, but we are planning a significant capacity and performance increase. Projected timeframe is to go into limited production during our October maintenance window, and ramp up from there.
We will also be implementing some performance tuning changes for the ethernet networks that should primarily benefit the non-GPFS project storage.
The /nv/pase1 filesystem will be moved back to its old storage server, which is now repaired and tested.
The tardis-6 head node will have some additional memory allocated.
And finally, some other minor changes – Firmware updates to our DDN/GPFS storage, as recommended by DDN, as well as installation of additional disks for increased capacity.
The OIT Network Backbone team will be upgrading the appliances that provide DNS & DHCP services for PACE. This should be negligible impact to us, as they have already rolled out new appliances for most of campus already.
Replacement of a fuse for the in-rack power distribution in rack H33.
— Neil Bright