GT Home : : Campus Maps : : GT Directory

PACE quarterly maintenance – (Nov 2-4, 2017)

Monday, October 23, 2017 Posted by
Comments closed


Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Nov 2 through the the end of Saturday (Nov 4). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Some of the planned improvements, storage migrations in particular, require attention of a large number of users. Please read on for more details and action items.

Storage (Requires user action)

PACE is retiring old NFS storage servers, which have been actively serving project directories for a large number of users. All of the data they contain will be consolidated into a new GPFS storage (pace2) purchased recently. GPFS is a high performance parallel filesystem, which offers improved reliability (and in many cases performance) compared to NFS.

Important: PACE will also start enforcing a 2 Million files/directories limit on this GPFS system, regardless of their size. We have identified the users who are currently using more than this limit and will contact them separately to prevent interruptions to research.

Here’s a full list of storage locations that will be migrated to ‘pace2′:

pg1 ,pc5, pe11, pe14, pe15, pe3, pe5, pe9, pe10, pe12, pe4, pe6, pe8, pa1, pbi1, pcc1, pcee2, pmse1, psur1, pc4, pase1, pmart1, pchpro1, pska1, pbiobot1, pf2, pggate1, ptml1,pc6, py1, py2, pc2, pz2, pe1, pe7, pe13, pe2, pb2, pface1, pas1, pf1, pb1, hp3, pj1, pb3, pc1, pz1, ps1, pec1, pma1

In addition to these NFS shares, we will also migrate these two filesystems from our current GPFS system (pace1) to the new GPFS system (pace2), due to limited space availability:

bio-konstantinidis, bio-soojinyi


How can I tell if  my project directory will be migrated?

Copy and run this command on the headnode:

find ~/data* -maxdepth 1 -type l -exec ls -ld {} \;

This command will return one or more lines similar to:

lrwxrwxrwx 1 root pace-admins 16 Jun 16 2015 /nv/hp16/username3/data -> /nv/pf2/username3
lrwxrwxrwx 1 root pace-admins 19 Jan 6 2017 /nv/hp16/username3/data2 -> /gpfs/pace1/project/pf1/username3

Please note the right hand side of the arrow “->”. If the arrow is pointing to a path starting with “/nv/…” and followed by a storage name included in the list provided above, then your data will be migrated. In this example, the location linked as “data” will be migrated (/nv/pf2/username3) but “data2″ will not (/gpfs/pace1/project/pf1/username3).

As an exception, all references to “bio-konstantinidis” and “bio-soojinyi” will be migrated, even tough their path starts with “/gpfs” and not “/nv”. E.g.:

lrwxrwxrwx 1 root bio-konstantinidis 43 Oct 11 2015 /nv/hp1/username3/data3 -> /gpfs/pace1/project/bio-konstantinidis/username3

What do I do if my project storage is being migrated?

No action needed for users who have been using the symbolic link names to access the storage (e.g. data, data2, etc.), because PACE will replace these links to point to the new locations.

If you have been referencing your storage using their absolute path (e.g. /nv/pf2/username), which is not recommended, then you need to replace all mentions of “/nv” with “/gpfs/pace1/project” in your codes, scripts and job submissions. E.g., “/nv/pf2/username3″ should be replaced as “/gpfs/pace1/project/pf2/username3″.

Users of bio-konstantinidis and bio-soojinyi should only need to replace “pace1″ with “pace2″. E.g., “/gpfs/pace1/project/bio-konstantinidis/username3″ should be replaced as “/gpfs/pace2/project/bio-konstantinidis/username3″.

NOTE: PACE strongly encourages all users to reference their project directories using their symbolic links (e.g. data, data2, …), rather than absolute paths, which are always subject to change. Doing so will minimize the user action needed when we make changes in the systems and configurations.

What if I don’t fix existing references to the old locations after my data are migrated?

PACE team will replace existing directories with links pointing to their new location to minimize user impact. This way, script/codes that are pointing to old paths can continue to run without needing any changes. However, this temporary failsafe measure will be in place for approximately 3 more months (until the next maintenance day). We strongly encourage all users to check if their data is being migrated, then fix their scripts/codes accordingly as needed, within this 3-months grace period. Please contact PACE team if you need any assistance with this process.

PACE team will also be monitoring jobs during this time period, and proactively reach out to users with jobs that are still using the old paths.


Schedulers (Requires some user action)

  • Consolidation of nvidia-gpu and gpu-recent queues on a new queue named “force-gpu”: This will require users of these queues to change the queue name to “force-gpu” in their submission scripts.
  • Clean up and improve PBSTools configurations and data
  • Migration of semap-6 queue to the dedicated-sched scheduler
  • [NEW] Migration of all joe queues on the shared-sched scheduler

ASDL / ITAR cluster (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Redistribute power connections to additional circuit in the rack
  • Replace CMOS batteries on compute nodes
  • Replace mother board on the file server, to use all the available memory slots

Power and network (no user action needed)

These planned maintenance tasks are completely transparent to users:

  • Update power distribution units on 2 racks
  • Move compute nodes to balance power utilization
  • Replace old, out of support switches
  • Update DNS appliances in Rich 116, 133 and BDCD
  • Increase redundancy to Infiniband connections between Rich 116 and 133

Campus preparedness and hurricane Irma

Friday, September 8, 2017 Posted by
Comments closed

Greetings PACE community,

As hurricane Irma makes its way along the projected path through Florida and into Georgia, I’d like to let you know what PACE is doing to prepare.

OIT Operations will be closely monitoring the path of the storm and any impacts it might have on the functionality of computer rooms in the Rich Computer Center and our backup facility on Marietta Street. In the event that either of these facilities were to loose power, they will enact emergency procedures and react as best as possible.

What does this mean for PACE?

The room where we keep the compute nodes only has a few minutes of battery protected power. While this is plenty to ride through any momentary glitches in power, it only lasts a few minutes. In the event of a power loss, compute nodes will power down and terminate whatever jobs are running. The rooms where we keep our servers, storage and backups have additional generator power which can keep them running longer. This too is a finite resource. In the event of power loss, PACE will begin orderly shutdown of servers and storage in order to reduce the chance of data corruption or loss.

Bottom line is that our priority will be to protect the critical research data, and enable successful resumption of research once power is restored.

Where to get further updates?

Our primary communications channels remain our mailing list,, and the PACE blog ( However, substantial portions of the IT infrastructure required for these to operate are also located in campus data centers. Additionally, OIT employs a cloud-based service to publish status updates. In the event that our blog is unreachable, please visit

GPFS problem (resolved)

Saturday, September 2, 2017 Posted by
Comments closed

This was much ado about nothing.  Running jobs continued to execute normally through this event, and no data was at risk.  What did happen is that jobs that could potentially have started were delayed.

A longer explanation –

We have monitoring agents that prevent jobs from starting if they detect a potential problem with the system.  The idea is to avoid starting a job if there’s a known reason that would cause a crash.  During our last maintenance period, we brought a new DDN storage system online and configured these agents to watch it for issues.  It did develop an issue, the monitoring agents flagged it and took nodes offline to new jobs.  However, we have yet to put any production workloads on this new storage so no running jobs were affected.

At the moment, we’re pushing out a change to the monitoring agents to ignore the new storage.  As this finishes rolling out, compute nodes will come online and resume normal processing.  We’re also working with DDN to address the issue on the new storage system.

GPFS Problem

Friday, September 1, 2017 Posted by
Comments closed

We are actively debugging a GPFS storage problem on our systems that unfortunately brought many queues offline. We do not yet fully know the cause and solution, but will update as soon as possible.

We apologize for the inconvenience and are actively working on a solution.

Please Use iw-dm-4 for File Transfers

Tuesday, August 29, 2017 Posted by
Comments closed

We experienced a slowdown yesterday on all headnodes caused by an unusually large amount of user file operations from headnodes. All headnodes are virtual machines and connect through a network file system gateway to the GPFS filesystem. This gateway became overwhelmed by the user file operations, and subsequently slowed down all headnode file operations.

If you have heavy file operations (i.e. winSCP, FileZilla, SCP), please perform these by logging directly into instead of a headnode. Additionally, other file operations such as tarring/zipping are best performed on compute nodes by submitting interactive or batch jobs, as well as iw-dm-4.

We’re actively looking into alternatives to virtual machine headnodes, and will provide more detailed updates as we approach our upcoming scheduled maintenance in November (via If you have any questions, please email us at

Resolved: Apparent CSH and TCSH Problems

Monday, August 14, 2017 Posted by
Comments closed

We’ve addressed some of the problems with the TrueNAS storage and CSH/TCSH should now be working again. As it turns out, this problem wasn’t actually related to the maintenance last week, and we will continue to work with the vendor regarding the cause.

Apparent CSH and TCSH Problems

Monday, August 14, 2017 Posted by
Comments closed

Currently, we’ve observed a correlation between hanging processes on all PACE systems and csh/tcsh, and are continuing to investigate. For the time being, if you have commands related to csh or tcsh, we ask that you please refrain from running them for the time being. It appears to be related to the TrueNAS storage system. We’re currently working with OIT and iXsystems to resolve the issue.

[RESOLVED] Post-maintenance Storage Issues

Saturday, August 12, 2017 Posted by
Comments closed

Update: GPFS storage is stabilized again. There remains several steps we need to take to complete this work, which can be completed without a downtime. We may need to take some nodes temporarily offline, which will be done in coordination with you without impacting the running jobs.

PACE systems started experiencing wide-spread problems with the GPFS storage shortly after releasing jobs after the maintenance tasks are complete. At fist glance,  they seem to be related to the Infiniband network.

We would like to advise against submitting new jobs until these issues are fully resolved. We will continue to work on a resolution and keep you updated on the progress.

Thank you for your patience.

PACE systems experiencing problems

Wednesday, August 9, 2017 Posted by
Comments closed
Update: The immediate issue is now resolved. We continue to work with the vendor to find more about what caused the problem. If you need assistance with failed jobs or unavailable headnodes, please contact
We are experiencing a storage issue that’s impacting majority of the home directories and some services, with potential impact on running jobs. We are currently in process of assessing the overall impact and investigating the root cause. We’ll update our blog ( as we know more about the failures.
We are sorry of the inconvenience and thank you for your patience.
Update: Impacted filesystems:

PACE quarterly maintenance – (Aug 10-12, 2017)

Thursday, August 3, 2017 Posted by
Comments closed

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Aug 10 through the the end of Saturday (Aug 12). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Storage (no user action needed)

We are working with the vendor to reconfigure our multiple GPFS filesystems for performance fine-tuning and better support for large scale deployments. Some of these configurations will be applied on the maintenance day because they require downtime.

Network (no user action needed)

PACE infiniband (IB) network requires some configuration changes to handle the rapid growth in the number of PACE nodes. We have identified several configuration parameters that will potentially reduce the occurrence of nodes losing their connectivity to GPFS (which relies on the IB network), causing intermittent job crashes.

Schedulers (postponed, no user action needed)

We have communicated our plans to upgrade scheduler in several occasions in the past, but skipped this task in the past maintenance days due to the bugs that we had uncovered. Despite the promising progress on the resolution of these bugs by the vendor, they are not fully resolved and tested yet. For this reason, we decided to once again postpone upgrading plans and keep the current versions until we have a bug-free and well-tested version.

Scheduler-based monitoring and analysis (no user action needed)

PACE research scientists started a collaboration with Texas Advanced Computing Center (TACC) to develop a new tool to analyze scheduler logs to gain insights about usage trends. For more details, please check

This tool heavily relies on a widely used utility named ‘PBSTools’, which is developed by Ohio Supercomputer Center. Our current installation of PBSTools is old, buggy and very slow. We will upgrade this tool and its database on the maintenance day to ensure that no job info will be lost during transition.

Power Work  (no user action needed)

PACE and the OIT Operations team will perform some work in the datacenter to improve electrical capacity and balance. This includes moving and/or replacing some of the in-rack Power Distribution Units (PDUs).

Software Repository (no user action needed)

Our NIST SP800-171 conforming secure cluster has a specific need for a dedicated copy of the PACE software repository. We have previously completed an initial replication of the files and will simply point the nodes to the replica during the maintenance window.