PACE A Partnership for an Advanced Computing Environment

January 28, 2016

PACE clusters ready for research

Filed under: tech support — admin @ 4:59 am

Our January maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

Transition to new scratch storage
Of approximately 1,700 PACE users, we were unable to migrate less than 35.  All users should have received an email as to their status.  Additionally, those users who were not migrated will have support tickets created on their behalf so we can track their migrations through completion.  We expect about 25 of those 35 users to complete within the next 72 hours.  The remaining 10 have data in excess of the allowable quota and will be handled on a case-by-case basis.

Scheduler update
The new schedulers are in place and processing jobs.

Server networking
Task is complete as described.

GPFS tuning
Task is complete as described.

Filesystem migration – /nv/pk1
Task is complete as described.

Read-Only /usr/local
Task is complete as described.

Diskess node transition
We upgraded approximately 65 diskless nodes with local operating system storage.

January 26, 2016

UNDERWAY: PACE quarterly maintenance – January ’16

Filed under: tech support — admin @ 11:24 am

Our maintenance activities are now underway.  All PACE clusters are down.  Please watch this space for updates.

 

For details on work to be completed, please see our previous posts, here.

January 15, 2016

PACE quarterly maintenance – January ’16

Filed under: tech support — admin @ 11:25 pm

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, January 26 and continuing through Wednesday, January 27.  We have a couple of major items that hopefully will provide a much better PACE experience.

Transition to new scratch storage

Building on the new DDN hardware deployed in October, this item is the dominant activity in this maintenance period.  Our old Panasas scratch storage has now exceeded its warranty, so this is a “must do” activity.  Given then performance level of the Panasas and the volume of data it contains, we do not believe we will be able to migrate all data during this maintenance period.  So, we will optimize the migration to maximize the number of users migrated.  Using this approach, we believe we will be able to migrate more than 99% of the PACE user community.  After the maintenance window, we will work directly with those individuals who we are not able to migrate.  You will receive an email when your migration begins, and when it is complete.  (Delivered to your official GT email address, see previous post!)

After the maintenance period, the old Panasas scratch will still be available, but in read-only mode.  All users will have scratch space provisioned on the new GPFS scratch.  Files for users who are successfully migrated will not be deleted, but will be rendered inaccessible except to PACE staff.  This provides a safety net in the unlikely event that something goes wrong.

For the time being, we will preserve the 5TB soft quota and 7TB hard quota on the new GPFS scratch, as well as the 60 day maximum file age.  However, the timestamps of the files will get reset as they migrate, so the 60 day timer gets reset for all files.

The ~/scratch symlinks in your home directories will also be updated to point to the new locations, so please continue to use these paths to refer to your scratch files.  File names beginning with /panfs will no longer work once your migration is complete.

Scheduler update

Pending successful testing, we will also be rolling out a bug fix update to our Moab & Torque scheduler software and increase network connectivity for our most heavily used schedulers.  Among issues addressed in this release is a bug where we have seen erratic notifications about failures canceling jobs, incorrect groups being included in reports and some performance improvements.  Unlike previous scheduler upgrades, all previously submitted jobs will be retained.  No user action should be required as a result of this upgrade.

Server networking

We will be upgrading network connectivity on some of our servers to take advantage of network equipment upgraded in October.  No user action required.

GPFS tuning

We will adjust parameters on some GPFS clients to more appropriately utilize their Infiniband connections.  This only affects the old (6+ years) nodes with DDR connections.  We will also substitute NFS access for native GPFS access on machines that lack Infiniband connectivity or have otherwise been identified as poorly performing GPFS clients.  In particular, this will affect most login nodes.  The /gpfs path names on these machines will be preserved, so no user action is needed here either.

Filesystem migration – /nv/pk1

The /nv/pk1 filesystem for the Aryabhata cluster will be migrated to GPFS.

Read-Only /usr/local

The /usr/local filesystem will be exported read-only.  This is a security measure, and should not impact normal operations.

Diskless node transition

We will continue the transition away from diskless nodes that we started in October.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Changing the way PACE handles email

Filed under: tech support — admin @ 11:21 pm

Greetings!

In order to help ensure the reliability of email communications from PACE, we will be changing how we deliver mail effective Wednesday, January 20. (next week!) From this time forward, PACE will use only the officially published email addresses as defined in the Georgia Tech Directory.

This is a needed change, as we have many, many messages that we have been unable to deliver due to outdated or incorrect destinations.

The easiest way to determine your official email address is to visit http://directory.gatech.edu and enter your name. If you wish to change your official email address, visit http://passport.gatech.edu.

In particular, this change will affect the address which is subscribed to PACE related email lists (i.e. pace-availability and such) as well as job status emails generated automatically from the schedulers.

For the technically savvy, we will be changing our mail servers to lookup addresses from GTED. We will no longer use the contents of a users ~/.forward file.

p.s. Users of the Tardis cluster do not have entires in the Georgia Tech directory, this change does not apply to you.

January 7, 2016

Early Testflight Scheduler Upgrade

Filed under: Uncategorized — Semir Sarajlic @ 10:03 pm

As you may know, we are preparing for upgrading the scheduler versions (that are known to be faster and less buggy) on the next maintenance day (01/26/2015, Tue).

The “testflight-sched” scheduler, which runs “testflight” and “ligo-6” queues, will receive these updates earlier for testing, most likely today. The upgrades will be mostly transparent from users, with the exception of 30min (estimated) downtime on the scheduler server as well as “testflight-6”  and “ligo-6” headnodes. For the duration of scheduler upgrade, your queries and commands will return “Cannot reach server”. The headnodes will also need to be rebooted several times, so please make sure you don’t use them for anything critical (text editing, interactive matlab sessions, etc). We confirmed that old client services on the compute nodes can still communicate with the new server, so we will be able to upgrade nodes one-by-one, without killing any running jobs, as they become idle.

Once the upgrades are complete (we will let you know), we strongly encourage every PACE user to run at least a few test jobs on testflight to make sure everything will work after the upgrades. We cannot express enough the importance of testing the new version, given our past experience with scheduler upgrades. Please contact us as soon as possible if you notice any problems or odd behavior.

Another reason for this early upgrade is to finalize upgrade procedures, which means that they are not tested yet. Therefore, expect problems and don’t rely on testflight for anything critical (which is a warning that applies to this testflight at all times, as the name suggests).

Thank you in advance for your cooperation and feedback!

Powered by WordPress