GT Home : : Campus Maps : : GT Directory

Archive for April, 2011

upcoming quarterly maintenance – 4/19/2011

Posted by on Monday, 4 April, 2011

Just a reminder folks – our quarterly maintenance is coming soon.

I’ve lowered the maximum allowed time for jobs to 14 days, and will continue to decrement as we approach the maintenance window.  This only applies to newly submitted jobs.  Jobs that haven’t completed by the morning of 4/19 will be cancelled.

Major items on the list are the application of standard OS updates from RedHat, updating and normalization of firmware in Infiniband host adapters, and rebooting of the main Infiniband switch to complete the transition to a software based subnet manager.

***** IMPORTANT *****

Please use the TestFlight cluster (announced here http://blog.pace.gatech.edu/?p=3151) to test your codes.  The updates have already been applied there, and we would appreciate any feedback (positive or negative) you have.

***** IMPORTANT *****

Expect all clusters to be offline for the day. This includes:

  • Aryabhata
  • Athena
  • Atlantis
  • Atlas (RedHat updates will not be applied)
  • BioCluster
  • Cygnus
  • FoRCE
  • Joe
  • TestFlight Cluster (RedHat updates already applied)
  • Legacy PACE Community Cluster (RedHat updates will not be applied)
  • Uranus

Detailed per-user usage metrics now available

Posted by on Monday, 4 April, 2011

We are pleased to announce the immediate availability of per-user/group/cluster usage metrics for current PACE clusters.  This software provides resource utilization metrics by period and cluster for the following menu options:

  • Dashboard (pie chart overviews)
  • Wait Time
  • CPU Consumption
  • User Detail
  • Group Detail
  • Queue Detail

These metrics can be accessed at: http://metrics.pace.gatech.edu.  (available to on-campus networks only, authentication required)  Please note that there can be a small deviation between the results presented and actual usage.

TestFlight Cluster available!

Posted by on Monday, 4 April, 2011

We now have a small cluster available for use by all PACE participants.  The TestFlight cluster is a small testing resource intended to provide an insulated environment for debugging and validation both of user code as well as system updates.

Currently, it is running a slightly updated software stack relative to our other clusters – it’s main features are standard updates from RedHat as well as a newer version of the Panasas client we use to access the high-performance scratch storage.  We are considering this software stack for roll-out at our next maintenance window on April 19, and ask that users test their current codes between now and then. Users may test the new software by submitting their jobs to a new queue named “testflight”.

Given the limited resource availability, our initial configuration for this queue is to limit each user to two (2) actively running jobs, each of which may run for no longer than 6 hours.  Additional jobs may be queued into the TestFlight cluster, but at most two will run at a time for a particular user.  The scheduler will run jobs for as many different users as queued jobs and resources are available.  The details of this queue, as well as our other queues, are published in our user guide, at http://share-it.gatech.edu/oit/pace/user-guide/the-job-scheduler/queue-names.

Equipment details of the TestFlight cluster may be found on our web site, at http://pace.gatech.edu/testflight-cluster.

Unless we find significant issues, these updates will be deployed on the clusters listed below.  Please note that the Atlas cluster is excluded from this list.

  • aryabhata
  • athena
  • atlantis
  • biocluster
  • cygnus
  • force
  • joe
  • uranus