The time has again come to discuss our upcoming quarterly maintenance. As you may recall, our activities lasted well into night during our last maintenance in July. Since then, I’ve been talking with various stakeholders and the HPC Faculty Governance Committee. Starting with our October maintenance and going forward, we will be extending our quarterly maintenance periods from one to two days. In October, this will be Tuesday the 15th & Wednesday 16th. I’ve updated our schedule on the PACE web page. Upcoming maintenance periods for January, April and July will be posted shortly.
Please continue reading below, there are a couple of things that will require user actions after maintenance day.
Scheduled fixes for October include the following:
- Project storage: We will deploy fixes for the remaining storage servers. This will complete the roll out of these fixes initiated during our last maintenance period. These fixes incorporate our best known stable Solaris platform at this point. Between these fixes, and the networking fixes below, we believe this to resolve most, if not all, of the storage issues we’ve been having lately.
- Networking updates: We have three categories of work here. The first is to upgrade the firmware on all of the switches in our gigabit ethernet fabric. This should solve the switch rebooting problem. The second item is to finish the ethernet redundancy work we didn’t complete in July. While this redundancy work will not ensure that individual compute nodes won’t suffer an ethernet failure, it does nearly eliminate the number of single points of failure in the network itself. No user visible impact is expected. We’re also planning on updating the firmware on some of our smaller Infiniband switches to bring them in line with the version of software we’re running elsewhere.
- Moab/Torque job scheduler: In order to mitigate some response issues with the scheduler, we are transitioning from a centralized scheduler server that controls all clusters (well, almost all) to a set of servers. Sharing clusters will remain on the old server, and all of the dedicated clusters [1] will be distributed out to a series of new schedulers. In all instances, we will still run the same _version_ of the software, but we’ll just have a fair bit more aggregate horsepower for scheduling. This should provide a number of benefits, primarily in the response you see when submitting new jobs and querying the status of queued or running jobs. Provided this phase goes well, we will look to upgrade the version of the moab/torque software in January.
- There are some actions needed from the user community. Users of dedicated clusters will need to resubmit jobs that did not get started before the maintenance. The scheduler will ensure that it does not schedule a job that would not complete before maintenance, so this will only affect jobs that were submitted but never started. You are affected if you _do not_ have access to the iw-shared-6 queue.
- New storage platform: We have been enabling access to the DDN storage platform via the GPFS filesystem on all RHEL6 clusters. This is now complete and we are opening up the DDN for investment. Faculty may purchase drives on this platform to expand project spaces. Please contact me directly if you are interested in a storage purchase. Our maintenance activities will include an update to the GPFS software which provide finer grained options for user quotas.
- Filesystem balancing: We will be moving the /nv/pf2 project filesystem for the FoRCE cluster to a different server. This will allow some room for expansion and guard against it filling completely. We expect no user-visiable changes here either, all data will be preserved, no paths will change, etc. The data will just reside on a different physical server(s).
- vmWare improvements: We will be rebalancing the storage used by our virtual machine infrastructure (i.e. head nodes), and other related tasks aimed at improving performance and stability for these machines. This is still an active are of preparation, so the full set of fixes and improvements remain to be fully tested.
- Cluster upgrades: We will be upgrading the Atlantis cluster from RHEL5 to RHEL6. Also, we will be upgrading 32 infiniband-connected nodes in Atlas-5 to Atlas-6.
[1] Specifically, jobs submitted to the following queues will be affected:
- aryabhata, aryabhata-6
- ase1-6
- athena, athena-6, athena-8core, athena-debug
- atlantis
- atlas-6, atlas-ge, atlas-ib
- complexity
- cssbsg
- epictetus
- granulous
- joe-6, joe-6-vasp, joe-fast, joe-test
- kian
- martini
- microcluster
- monkeys, monkeys_gpu
- mps
- optimus
- rozell
- skadi
- uranus-6