PACE A Partnership for an Advanced Computing Environment

January 9, 2015

PACE quarterly maintenance – January ’15

Filed under: tech support — admin @ 10:39 pm

Hi everybody,

Our January maintenance window is upon us.  We’ll have PACE clusters down Tuesday and Wednesday next week, January 13 and 14.  We’ll get started at 6:00am on Tuesday, and have things back to you as soon as possible.

Major items this time around include:

  • routine patching on our DDN system that servers project directories for many users.
  • routine patching on the file server that provides the storage for /usr/local and our virtual machine infrastructure (including most head nodes)
  • firmware updates on some NFS project directory servers to address stability issues

Additionally, the Joe and Atlas cluster users have graciously offered to test out an upgraded version of the Moab/Torque scheduler software.  Presuming we have success with these two clusters, we will look to roll out the upgrades to the rest of the PACE universe during our April maintenance period.  If you use clusters other than Atlas and Joe, this the rest of this announcement will not affect you next week. Users of Atlas and Joe can expect the following:

  • The current version uses a different database, so we will not be able to migrate submitted jobs.  The scheduler will start with an empty queue, and you will need to resubmit your jobs after the maintenance day (sorry for this inconvenience).
  • We will start using “node packing” which allocates as many jobs on a node as possible before jumping on the next one. With the current version, users can submit many single-core jobs, each landing on a separate node, making it more difficult for the scheduler to start jobs that require entire nodes.
  • You will be able to use msub for interactive jobs (which is currently broken due to a bug), although the recommendation from the vendor company is to use “qsub” for everything (we confirmed that it’s much faster than msub).
  • There will no longer be a discrepancy between job IDs generated by msub (Moab.###) and qsub (####). You will always see a single job ID (in plain number format) regardless of your msub/qsub preference.

Other improvements included in the scheduler upgrade:

  • Speed – new versions of Moab and Torque are now multithreaded, making it possible for some query commands (e.g. showq) to return instantly regardless of the load on the scheduler. Currently, when a user submits a large job array, these commands usually timeout.
  • Introduction of cpusets. When a user is given X cores, he/she will not be able to use more than that. Currently, users can easily violate the requested limits by spawning more processes and threads and Torque cannot do much to stop that. This will significantly reduce the job interference and allows us to finally use ‘node packing’ as explained above.
  • Several other benefits from bug fixes and improvements including but not limited to, zombie processes, lost output files, missing array jobs, long job allocation times, etc.

We hope these improvements will provide you with a more efficient and productive computing environment. Please let us know (pace-support@oit.gatech.edu) if you have any concerns or questions regarding this maintenance period.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress