GT Home : : Campus Maps : : GT Directory

Author Archive

PACE Procurement Timeline Adjustments

Posted by on Friday, 29 March, 2019

PACE Staff have completed our move to the CODA building and are settling in. We’ve also added a couple of new faces to the team, announcements will be forthcoming shortly.

As the year-end purchasing deadlines approach, we wanted to update the community on some changes to our procurement calendar. We’re doing our best to advocate for the research community and navigate some tough realities. We’ve nearly exhausted our space in the Rich Computer Center, and are very limited in our ability to deploy new equipment in that space. The CODA datacenter will be our new home (more on that below) but is not quite ready yet.

As such, we have cancelled the previously planned FY19-Phase3 and will need to shift some dates for our last order in FY19, FY19-Phase4. This shift results in FY19-Phase4 and FY20-Phase1 essentially being deployed concurrently around October of 2019. For this reason, we strongly encourage faculty to participate in FY20-Phase1 and reserve FY19-Phase4 for those who need to use funds expiring in FY19.

We will also adjust configurations and pricing for FY19-Phase4 and FY20-Phase1 based on upcoming processing technology and market conditions once that pricing is available to the public.

Finally, planning is in progress for PACE to migrate existing research cyberinfrastructure from the Rich data center to CODA, and all efforts will be made to minimize disruption to research efforts during this move. The execution phase will not begin until at least October 2019.

To view the published schedule online or for more information, visit or email

Best Regards,

-PACE Team

PACE clusters ready for research

Posted by on Saturday, 16 February, 2019
Our February 2019 maintenance ( is complete on schedule. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.
Please let us know any problems you may notice:

* (COMPLETE) Vendor will replace defective components on groups of servers

* (COMPLETE) Ethernet network reconfiguration

* (COMPLETE) GPFS / DDN enclosure reset

* (COMPLETE) NAS maintenance and reconfiguration

• (COMPLETE) PACE VMWare reconfiguration to remove out of support hosts

* (COMPLETE) Migration of Megatron cluster to RHEL7

PACE quarterly maintenance – (Feb 15-16, 2019)

Posted by on Friday, 18 January, 2019

[Update – 02/11/2019] Our updated quarterly scheduled maintenance task list will include the following:


  • (no user action needed) Vendor will replace defective components on groups of servers


  • (no user action needed) Ethernet network reconfiguration


  • (no user action needed) GPFS / DDN enclosure reset
  • (no user action needed) NAS maintenance and reconfiguration


  • (no user action needed) PACE VMWare reconfiguration to remove out of support hosts


[Original Post – 01/18/2019] We are preparing for a short maintenance day on February 15, 2019. Unlike our regular schedule, which starts on Thursdays and takes three days, this maintenance will start on a Friday and take only two days.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

In general, we’ll perform maintenance on the GPFS storage, migrate some Virtual Machines to new servers, perform hardware changes on one of the clusters, and finalize the migration of “/usr/local”, which is network attached mount point on all machines, to a more reliable storage pool.

While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.

We’ll update this post as we have more details.



Changes to mount points (no user impact expected)

Posted by on Thursday, 3 January, 2019

The investigation results that followed the system failures that temporarily rendered the scientific repository unresponsive ( will require some additional maintenance. To facilitate this maintenance, we will make a change to the mount point for /usr/local, which is network mounted and identical on all compute nodes.

Our tests indicate that this swap can be performed live, without impacting running jobs. It’s also completely transparent to users; you don’t need to change or do anything as a result.

In the unlikely event of job crashes that you suspect are caused by this operation, please contact and we’ll be happy to assist.

Thank you,

[Resolved] Wide spread problems impacting all PACE machines

Posted by on Tuesday, 11 December, 2018

Update (12/21, 10:15am): A correction: The problems have started this morning around 8:15am, not yesterday evening as previously communicated. The systems were back online at 8:45am.

Update (12/21, 9:15am): There has been another incident started last night, causing the same symptoms (hanging and unavailability of scientific repository). OIT storage engineers reverted the services on the redundant system (high availability pair) and the storage is available again. We continue to work on investigating the root cause of recurring failures experienced since the past several weeks.

Update (12/12, 6:30pm): The services are successfully migrated to the high availability pair and the filesystems are once again accessible. We’ll continue to monitor the systems and take a close look into the errant components. It’s still a possibility that some of these problems may recur, but we’ll be ready to address them should they happen.

Update (12/12, 5:30pm): Unfortunately the problems seem to be coming back. We continue to work on this. Thank you for your patience.

Update (12/12, 11:30am): We identified the root cause as a configuration conflict between two devices and resolved the problem. All systems are back online and available for jobs.

Update (12/12, 10:00am): Our battle with the storage system continues. This filesystem is designed as a high availability service with redundancy components to prevent such situations, but unfortunately the second system failed to take over successfully. We are investigating the possibility of network being the culprit. We continue to work rigorously to bring the systems back online ASAP.

Update (12/11, 9:00pm): Continued problems, we are working on it with support from related OIT units. 

Update (12/11, 7:30pm): We mitigated the issue, but the intermittent problems may continue to recur until the root cause is addressed. We continue to work on it.

Original message:

Dear PACE Users,

At around 3:45pm on Dec 11  the fileserver that serves the shared “/usr/local” on all PACE machines started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

We are working to resolve these problems ASAP and will keep you updated on this post.





[RESOLVED] Temporary unavailability of home directories

Posted by on Wednesday, 19 September, 2018

At around 6:10pm on Sep 19, 2018 the storage servers that export PACE home directories and the software repository experienced a problem. We have identified and resolved the issue within 15 min after the event.

This problem caused temporary unavailability of home directories and applications. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to



Testflight queue transition and unavailability

Posted by on Wednesday, 12 September, 2018

As you know, the testflight queue includes nodes that are reserved for testing the systems/services that are planned to be deployed in the future.

As a part of our preparations to transition to the next OS (RHEL7) we will offline this queue, swap its nodes with newly purchased nodes (that better represent the modern systems currently in use), and finally deploy the RHEL7 on these new nodes.

Once these preparations are complete, we’ll reach out to you and ask you to test your codes. Until then, testflight will not be available and submissions will be declined.

There are currently some jobs running on this queue. We’ll wait until the current jobs complete instead of killing them, but we would like to once again emphasize that the use of testflight for production is against policy. This queue should only be used for testing purposes.

Please let us know if you have any questions.

[Resolved] File locking issues causing hanging in codes and login troubles

Posted by on Thursday, 6 September, 2018

If you have been observing mysteriously hanging codes, or trouble logging in on headnodes, please read on!

We started receiving reports for hanging processes, mostly for GPU codes. In addition, users who are using tcsh/csh shell as default had difficulties logging into nodes.

Upon further investigation, we found that a storage problem was affecting file locking mechanism on home directories (where most applications keep the configuration files, regardless of where they run).

This problem was very subtle, as it was impacting only a small number of processes and data operations appeared to be working well otherwise.

We have addressed this issue this morning (9/6, 10am) and you should no longer see hanging codes. Please report any ongoing issues to

[RESOLVED] Scratch storage problems

Posted by on Tuesday, 14 August, 2018
Update (08/15/2018): As suspected, internal data migrations were not happening automatically. We worked with the vendor to address the issue and it’s now once again safe to use the scratch storage. We’ll keep on monitoring the utilization just in case.
Original post:
We have received multiple reports of jobs crashing due to insufficient scratch storage, although the physical usage is only at %38.
We suspect that this issue is related to some disk pools that are not able to migrate data to other pools internally.
We are currently looking in to this problem. In the mean time, we recommend not using the scratch space if possible, until we have a better understanding of the situation.
Thank you, and sorry for this inconvenience.

[COMPLETE] PACE quarterly maintenance – (Aug 9-11, 2018)

Posted by on Monday, 30 July, 2018

update (Aug 10, 2018, 8:00pm): Our Aug 2018 maintenance is complete, one day ahead of schedule. All of the tasks are completed as planned. We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. As usual, there are a small number of straggling nodes we will address over the coming days.

Please note the important changes regarding decommissioned login nodes, including the commonly used force-6 headnode.
Our next maintenance period is scheduled for Thursday, Nov 1 through Saturday, Nov 3, 2018.
Original message:

The next PACE maintenance will start on 8/9 (Thr) and may take up to 3 days to complete, as scheduled.

As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete. You can reduce the walltime of such jobs to ensure completion before 6am on 8/9 and resubmit if this will give them enough time to complete successfully.

Planned Tasks


  • (some user action needed) Most PACE headnodes (login nodes) are currently Virtual Machines (VM) with slow response time and sub-optimal storage performance, which are often the cause of slowness.

We are in progress of replacing these VMs with more capable physical servers. After the maintenance day, your login attempts to these VMs will be rejected with a message that tells you which hostname should you be using instead. In addition, we are in the progress of sending each user a customized email with a list of old and new login nodes. Please don’t forget to configure your SSH clients to use these new hostnames.

Simply, “” will be used for all shared clusters and “” will be for dedicated clusters. You’ll notice that once you  login, you’ll be redirected to one of the several physical nodes automatically (e.g. login-s1, login-d2, …) depending on their current load.

There will be no changes to clusters which already come with a dedicated (and physical) login node (e.g. gryphon, asdl, ligo, etc)

  • (some user action needed) As some of the users have already noticed, users can  no longer edit cronjobs (e.g. crontab -e) on the headnodes. This is on purpose because the access to new login nodes (login-d and login-s) are dynamically routed to different servers depending on their load. This means, you may not be able to see the cron jobs you have installed the next time you login to one of these nodes. For this reason, only PACE admins can install the cronjobs on behalf of users to ensure consistency (only login-d1 and login-s1 will be used for crons jobs). If you need to add (or edit) cronjobs, please contact If you already have user cron jobs setup on one of the decommissioned VMs, they will be moved over to login-d1 or login-s1 during the maintenance so they’ll continue to run.


  • (no user action needed) Add a dedicated protocol node to the GPFS system to increase capacity and response time for non-InfiniBand connected systems. This system will gradually replace the IB gateway systems that are currently in operation.
  • (no user action needed) Replace batteries to DDN/GPFS storage controllers


  • (no user action needed) Upgrades to the DNS appliances in both PACE datacenters
  • (no user action needed) Add redundant storage links to specific clusters


  • (no user action needed) Perform network upgrades
  • (no user action needed) Replace devices that are out of support