GT Home : : Campus Maps : : GT Directory

Archive for January, 2011

Final stage of the Joe cluster migration

Posted by on Friday, 14 January, 2011

In consultation with the ChBE IT Committee, we have adjusted the Joe cluster migration plan to leverage our upcoming maintenance window.  On Wednesday (1/19/2011), the “old” RHEL4 nodes in the Joe cluster will be permanently decommissioned, as will the joe.pace.gatech.edu head node.  We will rename theneojoe.pace.gatech.edu head node to joe.pace.gatech.edu, and create a ‘neojoe’ alias that points to it.  Starting Thursday, we will begin the process of installing the same set of software as is on the “new” RHEL5 nodes and integrating the old nodes into the rest of the cluster.  If you have any reason you cannot run on the ‘neojoe’ portion of the cluster, please let us know ASAP.

REMINDER – upcoming quarterly maintenance – 1/19/2011

Posted by on Friday, 14 January, 2011

Just a reminder – our quarterly maintenance is coming up on Wednesday, January 19.  Expect all clusters and storage to be offline for the day.  This evening, we will lower the maximum job length to five days and continue to lower this limit daily.

Major items on the list are new firmware upgrades to all the Infiniband network switches and related software that will give us better stability and diagnostic ability into the IB network.  We’ll also be upgrading firmware on the file server network interfaces.

***** IMPORTANT *****

We’ll also be instituting space-reclamation on the scratch storage.  Going forward, the scratch will be cleaned daily, and any file older than 60 days will be removed.  An email notification of removal will be sent one week prior to the file’s removal. Remember, the scratch storage is not intended for long term storage of data sets.  As such, we do not maintain backups of this storage.

***** IMPORTANT *****

upcoming quarterly maintenance – 1/19/2011

Posted by on Tuesday, 4 January, 2011

Just a reminder folks – our quarterly maintenance is coming soon.  We had previously scheduled for January 18, but will be delaying until January 19 due to the MLK holiday.  We’re finalizing the list of items we need to address, and will followup shortly with the technical detail.

I’ve lowered the maximum allowed time for jobs to 14 days, and will continue to decrement as we approach the maintenance window.  This only applies to newly submitted jobs.  Jobs that haven’t completed by the morning of 1/19 will be cancelled.

Expect all clusters to be offline for the day. This includes:

  • Aryabhata
  • Athena
  • Atlantis
  • Atlas
  • BioCluster
  • FoRCE
  • Joe & NeoJoe
  • PACE Community Cluster
  • Uranus