GT Home : : Campus Maps : : GT Directory

Archive for category News

Survey results are in!

Posted by on Tuesday, 10 May, 2011

In the past year, PACE held a survey of current participants and non-participants of the High Performance Computing (HPC) clusters maintained by the PACE team at Georgia Tech.  Feedback from 156 respondents gave some good insight into satisfactions levels and where PACE can improve as this partnership grows.

It is also recognized that more than 40% of respondents are not participating in PACE is due to lack of awareness or familiarity.  We will be working hard bring more awareness to the campus.

Details of the results can be found at http://pace.gatech.edu/sites/default/files/PACESurvey2011.pdf

A special thanks to Kathi Wallace, Elizabeth Campell, and Lew Lefton for making the survey possible. Please email neil.bright@oit.gatech.edu for more information.

 

upcoming quarterly maintenance – 4/19/2011

Posted by on Monday, 4 April, 2011

Just a reminder folks – our quarterly maintenance is coming soon.

I’ve lowered the maximum allowed time for jobs to 14 days, and will continue to decrement as we approach the maintenance window.  This only applies to newly submitted jobs.  Jobs that haven’t completed by the morning of 4/19 will be cancelled.

Major items on the list are the application of standard OS updates from RedHat, updating and normalization of firmware in Infiniband host adapters, and rebooting of the main Infiniband switch to complete the transition to a software based subnet manager.

***** IMPORTANT *****

Please use the TestFlight cluster (announced here http://blog.pace.gatech.edu/?p=3151) to test your codes.  The updates have already been applied there, and we would appreciate any feedback (positive or negative) you have.

***** IMPORTANT *****

Expect all clusters to be offline for the day. This includes:

  • Aryabhata
  • Athena
  • Atlantis
  • Atlas (RedHat updates will not be applied)
  • BioCluster
  • Cygnus
  • FoRCE
  • Joe
  • TestFlight Cluster (RedHat updates already applied)
  • Legacy PACE Community Cluster (RedHat updates will not be applied)
  • Uranus

Detailed per-user usage metrics now available

Posted by on Monday, 4 April, 2011

We are pleased to announce the immediate availability of per-user/group/cluster usage metrics for current PACE clusters.  This software provides resource utilization metrics by period and cluster for the following menu options:

  • Dashboard (pie chart overviews)
  • Wait Time
  • CPU Consumption
  • User Detail
  • Group Detail
  • Queue Detail

These metrics can be accessed at: http://metrics.pace.gatech.edu.  (available to on-campus networks only, authentication required)  Please note that there can be a small deviation between the results presented and actual usage.

TestFlight Cluster available!

Posted by on Monday, 4 April, 2011

We now have a small cluster available for use by all PACE participants.  The TestFlight cluster is a small testing resource intended to provide an insulated environment for debugging and validation both of user code as well as system updates.

Currently, it is running a slightly updated software stack relative to our other clusters – it’s main features are standard updates from RedHat as well as a newer version of the Panasas client we use to access the high-performance scratch storage.  We are considering this software stack for roll-out at our next maintenance window on April 19, and ask that users test their current codes between now and then. Users may test the new software by submitting their jobs to a new queue named “testflight”.

Given the limited resource availability, our initial configuration for this queue is to limit each user to two (2) actively running jobs, each of which may run for no longer than 6 hours.  Additional jobs may be queued into the TestFlight cluster, but at most two will run at a time for a particular user.  The scheduler will run jobs for as many different users as queued jobs and resources are available.  The details of this queue, as well as our other queues, are published in our user guide, at http://share-it.gatech.edu/oit/pace/user-guide/the-job-scheduler/queue-names.

Equipment details of the TestFlight cluster may be found on our web site, at http://pace.gatech.edu/testflight-cluster.

Unless we find significant issues, these updates will be deployed on the clusters listed below.  Please note that the Atlas cluster is excluded from this list.

  • aryabhata
  • athena
  • atlantis
  • biocluster
  • cygnus
  • force
  • joe
  • uranus

REMINDER: upcoming changes to Fluent 6.3.x / Gambit 2.4.x support

Posted by on Wednesday, 23 February, 2011

Hi folks, just a reminder that Fluent 6.x will stop working on Saturday…

We do have Ansys 12 and 13 installed in the usual /usr/local/packages locations for you to transition to.

network maintenance Tuesday 3/1 early AM

Posted by on Wednesday, 23 February, 2011

The backbone team would like to perform some routine maintenance on the firewall that protects the external PACE networks between 6:00am and 7:00am on Tuesday March 1.  This will result in an outage of about 10 minutes during that hour, and will drop connections into the clusters.  It should not affect running jobs.  As always, please let me know if this is going to cause major problems.

Thanks!

Status of the Joe cluster

Posted by on Monday, 21 February, 2011

As many of you have noticed already, we have finished our integration of the two halves of Joe.  All nodes now have the same set of software, based on RHEL5, and receive their work from the same scheduler.  The “joe.pace.gatech.edu” and “neojoe.pace.gatech.edu” names refer to the same machine, they are no longer different.

Please see our user guide page [1] on queue names for the full list of queue names available to the Joe cluster.  Generally speaking, you probably want to be using the queue named “joe” along with as accurate as possible description of wall clock time, processors and memory requirements.  The scheduler will do the hard work of placing your job on the right resources to complete as soon as possible.

Thanks!

[1] http://share-it.gatech.edu/oit/pace/user-guide/the-job-scheduler/queue-names

Final stage of the Joe cluster migration

Posted by on Friday, 14 January, 2011

In consultation with the ChBE IT Committee, we have adjusted the Joe cluster migration plan to leverage our upcoming maintenance window.  On Wednesday (1/19/2011), the “old” RHEL4 nodes in the Joe cluster will be permanently decommissioned, as will the joe.pace.gatech.edu head node.  We will rename theneojoe.pace.gatech.edu head node to joe.pace.gatech.edu, and create a ‘neojoe’ alias that points to it.  Starting Thursday, we will begin the process of installing the same set of software as is on the “new” RHEL5 nodes and integrating the old nodes into the rest of the cluster.  If you have any reason you cannot run on the ‘neojoe’ portion of the cluster, please let us know ASAP.

REMINDER – upcoming quarterly maintenance – 1/19/2011

Posted by on Friday, 14 January, 2011

Just a reminder – our quarterly maintenance is coming up on Wednesday, January 19.  Expect all clusters and storage to be offline for the day.  This evening, we will lower the maximum job length to five days and continue to lower this limit daily.

Major items on the list are new firmware upgrades to all the Infiniband network switches and related software that will give us better stability and diagnostic ability into the IB network.  We’ll also be upgrading firmware on the file server network interfaces.

***** IMPORTANT *****

We’ll also be instituting space-reclamation on the scratch storage.  Going forward, the scratch will be cleaned daily, and any file older than 60 days will be removed.  An email notification of removal will be sent one week prior to the file’s removal. Remember, the scratch storage is not intended for long term storage of data sets.  As such, we do not maintain backups of this storage.

***** IMPORTANT *****

upcoming quarterly maintenance – 1/19/2011

Posted by on Tuesday, 4 January, 2011

Just a reminder folks – our quarterly maintenance is coming soon.  We had previously scheduled for January 18, but will be delaying until January 19 due to the MLK holiday.  We’re finalizing the list of items we need to address, and will followup shortly with the technical detail.

I’ve lowered the maximum allowed time for jobs to 14 days, and will continue to decrement as we approach the maintenance window.  This only applies to newly submitted jobs.  Jobs that haven’t completed by the morning of 1/19 will be cancelled.

Expect all clusters to be offline for the day. This includes:

  • Aryabhata
  • Athena
  • Atlantis
  • Atlas
  • BioCluster
  • FoRCE
  • Joe & NeoJoe
  • PACE Community Cluster
  • Uranus