GT Home : : Campus Maps : : GT Directory

Archive for June, 2015

PACE quarterly maintenance – July ’15

Posted by on Wednesday, 24 June, 2015

Greetings!

The PACE team is again preparing for our quarterly maintenance that will occur Tuesday, July 21 and Wednesday July 22.  We’re approximately a month away, but I wanted to remind folks of our upcoming activities and give a preview of what we are planning.

  • Updated GPFS client – We are currently testing version 3.5.0-25 for deployment, as recommended by DDN.  Preliminary testing has shown it to have the fix for the problems encountered during our April maintenance.
  • “newrepo” becomes the default software repository – We will make the new PACE software repository (currently referred to as ‘newrepo’) the default. This means you will no longer need to explicitly switch to it using ‘module load newrepo’ and all of the modules will point to this new repository by default. The current repository will continue to be available, and can be accessed via loading a module we will continue to be available as ‘oldrepo’ as long as needed, but all new software installations, upgrades and fixes will go into newrepo.
  • Full reset of Infiniband fabric – We will reboot all of our Infiniband switches and subnet managers to ensure we have cleared out all of the gremlins from the Infiniband troubles earlier this month.
  • New storage devices for home directories and /usr/local – We’ve ordered some new storage servers to upgrade the aging servers that are currently providing home directories and /usr/local.  These new servers come in a high-availability configuration so as to better guard against equipment failures.  As a bonus item, we may begin the migration of our virtual machine backing storage to a separate new storage device.  Both of these items are contingent on the new equipment arriving in time to be installed and tested before the maintenance period.
  • New “data mover” servers – Also pending arrival and testing of new equipment, we will replace the “data mover” systems known as iw-dm3 and iw-dm4.  These servers are intended to be used for large data movement activities, and will come with 40-gigabit ethernet and Infiniband connectivity.

Infiniband problems in PACE

Posted by on Tuesday, 16 June, 2015

PACE is experiencing problems after a Infiniband (IB) network failure, which affects MPI jobs as well as IB connected storage including GPFS (project space) and PanFS (scratch space).  It is possible that this problem caused crashed or hanging jobs.

The Infiniband network is restored at this point and we are now working to restore the storage mounts. We also paused job submissions to prevent new jobs from starting. We will allow jobs once the problems are completely resolved.

Thank you for your patience.
PACE team