GT Home : : Campus Maps : : GT Directory

PACE clusters ready for research

Saturday, October 15, 2016 Posted by
Comments closed

Our October 2016 maintenance period is now complete. We’ve compute nodes, login nodes and interactive (post-processing) nodes to the RedHat Enterprise Linux 6.7 previously deployed on the TestFlight cluster.  This included a large number of bugfix and security patches, a major step forward in the Infiniband layer and recompiled versions of various MPI, scientific libraries and applications in /usr/local.  Please do let us know (via email to if you see issues with your jobs.

We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap.

Hardware repairs to the project directory (~/data) system is complete.  Minor repairs to the scratch system will be rescheduled for a future maintenance period.  The issue is minor and should not disrupt performance or availability of the scratch system.  No user actions are expected.

The OIT Network Engineering team upgraded the software running on many of our switches to match that which is running elsewhere on campus.  This included our firewalls.  No user actions are expected.

Electrical work
These problems were a bit more extensive than originally anticipated.  With some help from the OIT Operations team, we have a alternate solution in place, and will complete this work at a future maintenance period.  No user action is expected.

Bonus objectives
We were able to add capacity to the project directory system, and we now have our first single filesystem that’s greater than a petabyte, coming at about 1.3PB.  Maybe that’ll last us a couple of weeks.  😉  Disks for the scratch system have been installed, we will add them into the scratch filesystem shortly.  This can be done live without impact to running jobs.

PACE quarterly maintenance – October 2016

Monday, October 10, 2016 Posted by
Comments closed

Dear PACE users,

It is again time for our quarterly maintenance. Starting at 6:00am Thursday, October 13, all resources managed by PACE will be taken offline. Maintenance is scheduled to continue through Saturday evening. Our next maintenance period is scheduled for January 19-21, 2017.

As previously posted, our major activity this time around will be an upgrade of all compute nodes, head nodes and interactive nodes from RedHat 6.5 to RedHat 6.7. This is simply deploying the operating system and software repository we been testing internally and have made available on the TestFlight cluster for the broader PACE community to perform testing as well. I’ll issue one more plea for additional testing to the user community and ask that you report experiences, either positive or negative, to

We have some hardware repairs to perform on both DDN devices. Internal redundancy has prevented any availability or integrity problems so we’ll take advantage of the downtime for DDN engineers to make repairs. No user action is expected.

The OIT Network Engineering team will be upgrading the software running on a number of PACE network devices. These updates are for stability and have been running on similar devices in the campus network for some time. No user action is expected.

Electrical work
OIT Operations will be replacing some faulty rack-level power distribution in a couple of racks. No user action is expected.

Bonus objectives
If sufficient time remains in the maintenance period, we will add capacity to both the GPFS scratch and project systems. We are able to perform these activities during normal operations, so we will prioritize bringing things back into operation as soon as possible.

Headnode GPFS Problem

Friday, September 30, 2016 Posted by
Comments closed

About 8:30pm this evening, one of the PACE systems that provides services to the GPFS files to headnodes and other PACE internal systems failed. When this happens, users may see the message “stale file handle” or you may notice there are no files under the /gpfs directory. This is a temporary condition that should be fixed shortly.

Please note: All files that were already written and all files accessed or written by any compute node are unaffected. However, if you were in the process of editing a file on a headnode, only your most recent changes may be unavoidably lost. In addition, any process you may have had running on a headnode system using these files may have been killed due to this failure.

To prevent this from recurring, PACE had ordered and very recently received a new computer to replace the system that failed this evening. Our staff will undertake the testing and replacement as soon as possible and we will post an announcement here once the new system is in service.

We apologize for this inconvenience and thank those users who let us know quickly.

TestFlight cluster available with RHEL6.7

Thursday, September 22, 2016 Posted by
Comments closed

The TestFlight cluster is now available with the updated RHEL6.7 load, as well as some recompilations of software in /usr/local. Please, do login to and try submitting your jobs to the ‘testflight’ queue. If you have any problems, please send a note to

PACE quarterly maintenance – October 2016

Tuesday, September 13, 2016 Posted by
Comments closed

Dear PACE users,

Quarterly maintenance is fast approaching. Starting at 6:00am on Thursday, October 13, all resources managed by PACE will be taken offline. Maintenance will continue through Saturday evening unless we are able to finish sooner.

Our major activity this maintenance period will be an operating system upgrade for all compute nodes, head nodes and interactive nodes. This update will take us from RedHat 6.5 to RedHat 6.7, and includes important security and bug fix updates to the operating system, a new Infiniband layer and some recompiled versions of existing /usr/local software. Some applications have shown increased performance as well.


PACE staff have been testing this upgrade using various existing applications but we need your help to ensure a smooth rollout. As of today, we have begun applying these updates to our TestFlight cluster, which is available for all to use. We’ll send out a follow up communication when it is ready. PLEASE, PLEASE, PLEASE, use the next few weeks to try your codes on the TestFlight cluster and send feedback to We would especially like to hear of any issues you may have, but reports of working applications would be helpful as well.

Our goal is to provide the best possible conversion to the updated operating system, and ask that you please take the opportunity to help us ensure a smooth transition back into normal operation by availing yourself of the TestFlight cluster.

localized network outage has some nodes offline

Thursday, August 25, 2016 Posted by
Comments closed

At approximately 10:40 this morning, the failure of a top-of-rack network switch in the P31 rack of our data center failed. This caused a loss of network connectivity for approximately 44 compute nodes across a wide variety of queues. (see below) No other compute nodes are affected. Jobs running on these nodes will likely have failed as a result. The OIT network team is swapping in a replacement at the moment, and PACE staff are working to restore service as quickly as possible.

If you have access to any of the queues below, please check on their status and resubmit as needed. You can check which queues you have access to by using the ‘pace-whoami’ command.

We apologize for the inconvenience, and will work to bring these nodes back online as soon as possible.  If you have additional questions, please email


resolved: storage problems this morning

Monday, August 1, 2016 Posted by
Comments closed

We look to be back up at this point.  The root cause seems to have been a problem with the subnet manager that controls the Infiniband network.  Since GPFS uses this network, the issue initially manifested as a storage problem.  However, many MPI codes use this network as well and may have crashed.

Again, we apologize for the inconvenience.  Please do check on your jobs if you use MPI.

storage problems this morning

Monday, August 1, 2016 Posted by
Comments closed

Happy Monday!

Since about 2:30am this morning, we have been experiencing a GPFS problem and, while all data is safe, all GPFS services are currently unavailable.  This includes the scratch space, and project directory (~/data) filesystems for many users.  We are working on restoring service as quickly as possible and apologize for the inconvenience.

Some PACE accounts temporarily disabled

Friday, July 22, 2016 Posted by
Comments closed

Some of our uses will not be able to login to PACE clusters after the maintenance day due to ongoing scratch data migration process. Here are the details:

One of the tasks performed during this maintenance day ( was to deploy a new fileserver to serve the “scratch” storage.

The new system is now operational, however the scratch data for some of the accounts are still in the process of migration. We temporarily inactivated these accounts to prevent modifications to incomplete data sets. The submitted jobs by these users are also held by the scheduler. Once the transfer is complete for a user, we will enable the account, release user’s jobs, and send a notification email. Currently we have no means to estimate how long the migration will take, but we are doing all we can to make this process go as quickly as possible.

If you are one of these users and need any of your data urgently, please contact us All of your data are intact and accessible by us. We will try to find a solution for delivering the data you need for your research.

PACE clusters ready for research

Friday, July 22, 2016 Posted by
Comments closed

Our July 2016 maintenance is now substantially complete.  Again, we sincerely apologize for the unfortunate additional unplanned downtime.

As previously communicated, we’ve had an unexpected delay caused by the data migrations from the old scratch system to newly acquired system. Some of these transfers are still in progress, with a limited number of users still remaining.  We have temporarily disabled access for these users to prevent jobs running on incomplete scratch data. We are reaching out to the affected users individually with more details. These users will not be able to login and their previously submitted jobs will not run until their scratch migration is complete. If you have not received a further notification from us and experience problems with logins or anything else, please do let us know as soon as possible by sending an email to

Scratch performance may be reduced as these migrations complete, and we are doing everything we can to finish these migrations as soon as possible.

We have brought compute nodes online released previously submitted jobs. As usual, we have a number of compute nodes that still need to be brought back online, but we are actively working to make them available asap. 


The new DDN SFA-7700 system is now operational and serving scratch storage for all users. We updated client software versions on all nodes.  We have encountered an anomaly that reduces its internal redundancy but does not affect normal operation.  We expect be able to rectify this while in production. 

Electrical work

Tasks complete as described

Bonus objectives

Network and local storage upgrades were implemented on schedulers as planned.  Additional diskless nodes were converted to diskfull as planned.