The major activity for maintenance day is the RedHat 6.1 to RedHat 6.2 software update. (Please test your codes!!) This will affect a significant amount of our user base. We’re also instituting soft quotas on the scratch space. Please see the detail below.
The following are running RedHat 5, and are NOT affected:
The following have already been upgraded to the new RedHat 6.2 stack. We would appreciate reports on any problems you may have:
- Monkeys
- MPS
- Isabella
- Joe-6
- Aryabhata-6
If I didn’t mention your cluster above, you are affected by this software update. Please test using the ‘testflight’ queue. Jobs are limited to 48 hours in this queue. If you would like to recompile your software with the 6.2 stack, please login to the ‘testflight-6.pace.gatech.edu’ head node.
Other activities we have planed are:
Relocating some project directory servers to an alternate data center on campus. We have strong network connectivity, so this should not change performance of these filesystems. No user modifications needed.
- /nv/hp3 – Joe
- /nv/pb1 – BioCluster
- /nv/pb3 – Apurimac
- /nv/pc1 – Cygnus
- /nv/pc2 – Cygnus
- /nv/pc3 – Cygnus
- /nv/pec1 – ECE
- /nv/pj1 – Joe
- /nv/pma1 – Math
- /nv/pme1 – Prometheus
- /nv/pme2 – Prometheus
- /nv/pme3 – Prometheus
- /nv/pme4 – Prometheus
- /nv/pme5 – Prometheus
- /nv/pme6 – Prometheus
- /nv/pme7 – Prometheus
- /nv/pme8 – Prometheus
- /nv/ps1 – Critcel
- /nv/pz1 – Athena
Activities on the scratch space – no user change is expected for any of this.
- We need to balance some users on volumes v3, v4, v13 and v14. This will involve moving users from one volume to another, but we will place links in the old locations.
- Run a filesystem consistency check on the v14 volume. This has the potential to take a significant amount of time. Please watch the pace-availability email list (or this blog) for updates if this will take longer than expected.
- firmware updates on the scratch servers to resolve some crash & failover events that we’ve been seeing.
- institute soft quotas. Users exceeding 10TB of usage on the scratch space will receive automated warning emails, but writes will be allowed to proceed. Currently, this will affect 6 of 750+ users. The 10TB space represents about 5% of a rather expensive shared 215TB resource, so please be cognizant of the impact to other users.
Retirement of old filesystems. User data will be moved to alternate filesystems. Affected filesystems are:
Performance upgrades (hardware RAID) for NFSroot servers for the Athena cluster. Previous maintenance activities have upgraded other clusters already.
moving some filesystems off of temporary homes and onto new servers. Affected filesystems are:
- /nv/pz2 – Athena
- /nv/pb2 – Optimus
If time permits, we have a number of other “targets of opportunity” –
- relocate some compute nodes and servers, removing retired systems
- reworking a couple of Infiniband uplinks for the Uranus cluster
- add resource tags to the scheduler so that users can better select compute node features/capabilities from their job scripts
- relocate a DNS/DHCP server for geographic redundancy
- fix system serial numbers in the BIOS for asset tracking
- test a new Infiniband subnet manager to gather data for future maintenance day activities
- rename some ‘twin nodes’ for naming consistency
- apply BIOS updates to some compute nodes in the Optimus cluster to facilitate remote management
- test an experimental software component of the scratch system. Panasas engineers will be onsite to do this and revert before going back into production. This will help gather data and validate a fix for some other issues we’ve been seeing.