…back to regularly scheduled events.
Our next maintenance window is fast approaching. We will continue the 2-day downtimes, with the next one occurring Tuesday, January 14 and Wednesday, January 15. The list of major changes is small this time around, but impactful.
The largest change, affecting all clusters, is a major update to the Moab & Torque scheduling system that is used to schedule and manage your jobs. The upgraded versions fix a number of long-standing problems and scaling issues with command-timeouts, stability, and processing large job-sets.
The testflight cluster has been updated, and is available to anyone that wishes to test their submission processes against the new upgraded versions.In many cases, the processes used to submit and query your jobs will remain the same. For some, a change in the way that you use the system — may be required. You will still be able to accomplish the same things, but may need to use different commands to do it.
We have updated our usage documentation to include a simple transition guide here.
In addition to the guide, we have also written a FAQ, which can be viewed by running the command ‘jan2014-faq‘ after logging in.
Because of the version differences between the old software and the new software, we will unfortunately not be able to preserve any jobs that are still in a queued state once maintenance begins. If you have any queued jobs going into maintenance, then you will need to resubmit them after maintenance.
The fixes planned for January also include the following:
Infrastructure:
- Operating System upgrades to the server running scheduling software for the “shared” clusters. This will bring it up to the same level as the other scheduler servers.
- Adjustments to scalability & performance parameters on our GPFS filesystem.
Optimus cluster:
- Optimus users will have access to a new queue: ‘optimus-force-6’, as well as access to the iw-shared-6 queue.
Gryphon cluster:
- The current (temporary) head node and scheduler server will return to their roles as compute nodes for the cluster.
- New servers will be brought into production for the head node & scheduler servers.
- Data migrations between the pb1, pb4 and DDN filesystems. This should be transparent to users, and ease the space crunch everybody has been experiencing.