PACE A Partnership for an Advanced Computing Environment

April 21, 2016

PACE clusters ready for research

Filed under: tech support — admin @ 4:09 am

Our April maintenance window is now complete.  As usual, we have a number of compute nodes that still need to be brought back online, however, we are substantially online and processing jobs at this point.

We did run into an unanticipated maintenance item with the GPFS storage – no data has been lost.  As we’ve added disks to the DDN storage system, we’ve neglected to perform a required rebalancing operation to spread load amongst all the disks.  The rebalancing operation has been running over the majority of our maintenance window, but the task is large and progress has been much slower than expected.  We will continue to perform the rebalancing during off-peak times in order to mitigate the impact on storage performance as best we are able.

Removal of /nv/gpfs-gateway-* mount points

Task complete as described.  The system should no longer generate these paths.  If you have used these paths explicitly, your jobs will likely fail.  Please continue to use paths relative to your home directory for future compatibility.  (e.g. ~/data, ~/scratch, etc.)

New GPFS gateway

Task complete as described

GPFS server and client tuning

Task complete as described

Decommission old Panasas scratch

Task complete as described.  Paths starting with /panfs no longer work.  Everybody should have been transitioned to the new scratch long ago, so we do not expect anybody to have issues here.

Enabling debug mode

Task complete as described.  You may see additional warning messages if your code not well behaved with regards to memory utilization.  This is a hint that you may have a bug.

Removal of compatibility links for migrated storage 

Task complete as described.  Affected users (Prometheus and CEE clusters)  were contacted before maintenance day.  No user impact is expected, but please send in a ticket if you think there is problem.

Scheduler updates

Task complete as described

Networking Improvements

Task complete as described

Diskless node transition

Task complete as described

Security updates

Task complete as described

April 19, 2016

UNDERWAY: PACE quarterly maintenance – April ’16

Filed under: tech support — admin @ 10:05 am

Quarterly maintenance is now underway.  All clusters managed by PACE, including Tardis, are now offline.  Please see our previous post for details.

April 13, 2016

PACE quarterly maintenance – April ’16

Filed under: Uncategorized — Semir Sarajlic @ 10:10 pm

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, April 19 and continuing through Wednesday, April 20.  We are planning several improvements that hopefully will provide a much better PACE experience.

GPFS storage improvements

Removal of all /nv/gpfs-gateway-* mount points (user action recommended): In the past, we had noticed performance and reliability problems with mounting GPFS natively on machines with slow network connections (including most headnodes, some compute nodes, and some system servers). To address this problem, we deployed a physical ‘gateway’ machine that mounts GPFS natively and serves its content via NFS to machines with slow network (see https://blog.pace.gatech.edu/?p=5842).

We have been mounting this gateway on *all* of the machines using these locations:

/nv/gpfs-gateway-pace1
/nv/gpfs-gateway-scratch1
/nv/gpfs-gateway-menon1

Unfortunately, these mount points caused some problems in the longer run, especially when a system variable (PBS_O_WORKDIR) being assigned these locations as the “working directory” for jobs even on machines with fast network connections. As a result, a large fraction of the data operations went through the gateway server, instead of the GPFS server, causing significant slowness.

We partially addressed this problem by fixing the root cause for unintended PBS_O_WORKDIR assignment, and also with user communication/education.

On this maintenance day, we are getting rid of these mount points completely. Instead, GPFS will always be mounted on:

/gpfs/pace1
/gpfs/scratch1
/gpfs/menon1

Regardless of how that particular node is mounting GPFS (natively or via the gateways).

User action: We would like to ask all of our users to please check your scripts to ensure that old locations are not being used. Jobs that try to use these locations will fail after the maintenance day (including those that have already been submitted).

A new GPFS gateway (no user action required): We increasingly rely on GPFS filesystem for multiple storage needs, including the scratch, majority of project directories, and some home directories.  While the gateway provided some benefits, some users continued to report unresponsive/slow commands on headnodes due to a combination of high levels of activity and limited NFS performance.
On this maintenance, we are planning to deploy a second gateway server to separate headnodes from other functions (compute nodes and backup processes). This will improve the responsiveness of headnodes, providing our users with a better interactivity on headnodes. In other words, you will see much less slowness when running system commands, such as “ls”.

GPFS server and client tuning (no user action required): We identified several configuration tuning parameters to improve the performance and reliability of GPFS in light of vendor recommendations and our own analysis. We are planning to  apply these configuration changes on this maintenance day as a fine tuning step.

Decommissioning old Panasas scratch (no user action required)

When we made the switch to the new scratch space (GPFS) on the January maintenance, we kept the old (Panasas) system accessible as read-only. Some users received a link to their old data if their migration had not completed within the maintenance window. We are finally ready to pull the plug on this Panasas system. You should have no dependencies on this system anymore, but please contact the PACE support as soon as possible if you have any concerns or questions regarding decommissioning of this system.

Enabling debug mode (limited user visibility)

RHEL6, which has been used on all PACE systems for a long while,  optionally  comes with a implementation of the memory-allocation functions to perform additional heap error/consistency checks at runtime. We’ve had this functionality installed, but memory errors have been silently ignored per our configuration, which is not ideal. We are planning to change the configuration to print diagnostics on the stderr when an error is detected. Please note, you should not see any differences in the way your codes are running, this only changes how memory errors are reported.  This behavior is controlled by the MALLOC_CHECK_ environment variable. A simple example is when a dynamically allocated array is freed twice (e.g. using the ‘free’ statement in C). Here’s a demo for different behaviors for three different values of MALLOC_CHECK_ when an array is freed twice:

MALLOC_CHECK_=0
(no output)


MALLOC_CHECK_=1

*** glibc detected *** ./malloc_check: free(): invalid pointer: 0x0000000000601010 ***

MALLOC_CHECK_=2

Aborted (core dumped)

We currently have this value set to “0” and will make “1” the default to dump some description of the error(s). If this change is causing any problems for you, or you simply don’t want any changes in your environment, then you can simply assign “0” to this value in your “~/.bashrc” to overwrite the new default.

Removal of compatibility links for migrated storage (some user action may be required)

We had migrated some of the NFS project storages (namely pcee1 and pme[1-8]) to GPFS in the past. When we did that, we placed links in the older storage (that starts with /nv/…) that points to the new gpfs location (starts with /gpfs/pace1/project/…) to protect active jobs from crashing. This was only temporary to facilitate the transition.

As a part of this maintenance day, we are planning to remove these links completely. We already contacted all of the users whose project are on these locations and confirmed that their ~/data links are updated accordingly, so we expect no user impact. That said, if you are one of these users, please make sure that none of your scripts reference to the old locations mentioned in our email.

Scheduler updates (no user action required)

We have a patched version of the resource manager (Torque) that had been deployed on the scheduler servers shortly after the January maintenance day. This patch addresses a bug in the administration functions only. While it’s not critical for compute nodes, we will go ahead and update all compute nodes to bring their version at par with the scheduler for consistency. This update will not cause any visible differences for the users. No user action required.

Networking Improvements (no user action required)

Spring is here and it’s time for some cleanup. We will get rid of unused cables in the datacenter and remove some unused switches from the racks. We are also planning some recabling to take better advantage of existing switches to improve redundancy. We will continue to test and enable jumbo frames (where possible) to lower networking overhead. None of these tasks require user actions.

Diskless node transition (no user action required)

We will continue the transition away from diskless nodes that we started in October 2015.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Security updates (no user action required)

We are also planing to update some system packages and libraries to address known security vulnerabilities and bugs. There should be no user impact.

Powered by WordPress