GT Home : : Campus Maps : : GT Directory

October 2013 PACE maintenance complete

This entry was posted by on Thursday, 17 October, 2013 at
Greetings!We have completed our maintenance activities for October.  All clusters are open, and jobs are flowing.  We came across (and dealt with) a few minor glitches, but I’m very happy to say that no major problems were encountered.  As such, we were able to accomplish all of our goals for this maintenance window.

  • All project storage servers have had their operating systems updated.  This should protect from failures during high load.  Between these fixes, and the networking fixes below, we believe all of the root causes of storage problems we’ve been having recently are resolved.
  • All of our redundancy changes and code upgrades to network equipment have been completed.
  • The decentralization of job scheduling services has been completed.  You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs.
    • The decentralization of job scheduling services has been completed.  You should see significantly improved responsiveness when submitting jobs or checking on the status of existing jobs.  Please note that you will likely need to resubmit jobs that did not have a chance to run before Tuesday.  Contrary to previously announced and intended designs, this affects the shared clusters as well.  We apologize for the inconvenience.
    • Going forward, the scheduler decentralization has a notable side effect.  Previously, any login node could submit jobs to any queue, as long as the user had access to do so.  Now, this may no longer be the case.
    • For instance, a user of the dedicated cluster “Optimus” that also had access to the FoRCE, could simply submit jobs to the force queue from the optimus head node.  Now, That user will no longer be able to do so, as Optimus and FoRCE are scheduled by different servers.
    • We believe that these cases should be quite uncommon.  If you do encounter this situation, you should be able to simply login to the other head node and submit your jobs from there.  You will have the same home, project and scratch directories from either place.  Please let us know if you have problems.
  • All RHEL6 clusters now have access to our new GPFS filesystem.  Additionally, all of the applications in /usr/local (matlab, abacus, PGI compilers, etc.) have been moved to this storage.  This should provide performance improvements for these applications as well as the Panasas scratch storage, which was the previous location of this software.
  • Many of our virtual machines have been moved to different storage.  This should provide an improvement in the responsiveness of your login nodes.  Please let us know (via pace-support@oit.gatech.edu) if you see undesirable performance from your login nodes.
  • The Atlantis cluster has been upgraded from RHEL5 to RHEL6 (actually, this happened before this week), and 31 Infiniband-connected nodes from the RHEL5 side of the Atlas cluster have been upgraded to RHEL6.  (The 32nd has hardware problems and has been shut down.)
  • The /nv/pf2 project filesystem has been migrated to a server with more breathing room.

Additionally, we were able to complete a couple of bonus objectives.

  • You’ll notice a new message when logging in to your clusters.  Part of this message is brought to you from our Information Security department, and the rest is intended to give a high-level overview of the specific cluster and the queue commonly associated with it.
  • Infiniband network redundancy for the DDN/GPFS storage.
  • The /nv/pase1 filesystem was moved off of temporary storage, and onto the server purchased for the Ase1 cluster.

Comments are closed.