GT Home : : Campus Maps : : GT Directory

Author Archive

PACE is experiencing storage (GPFS) problems

Posted by on Friday, 16 June, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts most of the project directories.

We are working with the vendor to investigate the ongoing issues. At this moment we don’t know whether they are related to yesterday’s power/cooling failures or not, but we will update the PACE community as we find out more.

This issue has potential impact on running jobs and we are sorry for this inconvenience.

PACE datacenter experienced a power/cooling failure

Posted by on Friday, 16 June, 2017
What happened: We had a brief power failure in our datacenter, which took out cooling in racks running chilled water. This impacted about 160 nodes from various queues, with potential impact on running jobs.
Current Situation: Some cooling has been restored, however we had to issue a shut down to a couple of the highest temperature racks that were not cooling down (p41, k30, h43, c29, c42). We are keeping a close eye on the remaining racks that were in the risk area in coordination with the Operations team as they continue to monitor temperatures in these racks.
We will start bringing the down nodes online once the cooling issue is fully resolved.
What can you do: Please resubmit failed jobs (if any) if you were using any of the queues listed below. As always, contact pace-support@oit.gatech.edu for any kind of assistance you may need.
Thank you for your patience and sorry for the inconvenience.

Impacted Queues:

—————————
apurimac-6
apurimacforce-6
atlas-6
atlas-debug
b5force-6
biobot
biobotforce-6
bioforce-6
breakfix
cee
ceeforce
chemprot
chowforce-6
cnsforce-6
critcel
critcel-burnup
critcelforce-6
critcel-prv
cygnus
cygnus-6
cygnus64-6
cygnusforce-6
cygnus-hp
davenprtforce-6
dimerforce-6
ece
eceforce-6
enveomics-6
faceoff
faceoffforce-6
force-6
ggate-6
granulous
gryphon
gryphon-debug
gryphon-prio
gryphon-tmp
hygeneforce-6
isabella-prv
isblforce-6
iw-shared-6
martini
mathforce-6
mayorlab_force-6
mday-test
medprint-6
medprintfrc-6
megatron
megatronforce-6
microcluster
micro-largedata
monkeys_gpu
mps
njordforce-6
optimusforce-6
prometforce-6
prometheus
radiance
rombergforce
semap-6
skadi
sonarforce-6
spartacusfrc-6
threshold-6
try-6
uranus-6

Infiniband switch failure causing partial network and storage unavailability

Posted by on Thursday, 25 May, 2017
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

PACE quarterly maintenance – May 11, 2017

Posted by on Monday, 8 May, 2017

PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Systems

  • We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.

Storage

  • Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)

Network

  • Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
  • Software upgrades to the PACE firewall appliance to address a known bug
  • New subnets and re-assignment of IP addresses for some of the clusters

Power

  • PDU fixes that are impacting 3 nodes in c29 rack

The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.

Please test the new patched kernel on TestFlight nodes

Posted by on Wednesday, 1 March, 2017

As some of you are already aware, the dirty cow exploit was a source of great concern for PACE. This exploit can allow a local user to gain elevated privileges. For more details, please see “https://access.redhat.com/blogs/766093/posts/2757141”.

In response, PACE has applied a mitigation on all of the nodes. While this mitigation is effective in protecting the systems, it has a downside of causing debugging tools (e.g. strace, gdb and DDT) to stop working. Unfortunately, none of the new (and patched) kernel versions made available by Red Hat supports our Infiniband network drivers (OFED), so we had to leave the mitigation running for a while. This caused inconvenience, particularly for users who are actively developing codes and relying on these debuggers.

As a long term solution, we patched the source code of the kernel and recompiled it, without changing anything else. Our initial tests were successful, so we deployed it on three of the four online nodes in the testflight queue:

rich133-k43-34-l recompiled kernel
rich133-k43-34-r recompiled kernel
rich133-k43-35-l original kernel
rich133-k43-35-r recompiled kernel

We would like to ask you to please test your codes on this queue. Our plan is to deploy this recompiled kernel to all of the PACE nodes, including headnodes and compute nodes. We would like to make sure that your codes will continue to run after this deployment without any difference.

The deployment will be a rolling update, that is, we will opportunistically patch nodes starting from the idle nodes. So, there will be a mix of nodes with old and recompiled kernels in the same queues until the deployment is complete. For this reason, we strongly recommend testing multi-node parallel applications that will include the node with the original kernel (rich133-k43-35-l) in the hostlist to test the behavior of your code with mixed hostlists.

As always, please keep your testflight runs short to allow other users to test their own codes. Please report any problems to pace-support@oit.gatech.edu and we will be happy to help. Hopefully, this deployment will be completely transparent to most users, if not all.

Power maintenance 12/19/2016 (Monday)

Posted by on Friday, 16 December, 2016

(No user action needed)

We have been informed GT Facilities will perform critical power maintenance beginning at 6am Monday 12/19/2016, in one of the PACE datacenters.

We believe, after a careful investigation, PACE systems should have sufficient power redundancy to allow the careful work to be completed without required downtime or failure. However, there is always a small risk that some jobs or service will be impacted. We will work closely with the OIT operations and facilities teams to help protect running jobs from failures. We will keep all PACE users informed of progress or should failures occur.

PACE scratch storage is now larger and faster

Posted by on Wednesday, 2 November, 2016

We have increased the capacity of the scratch storage from ~350TB to ~522TB this week, matching the capacity of the old server (Panasas) that was decommissioned back in April. The additional drives were installed without any downtime, with no impact on jobs.

This also means larger number of drives contributing to parallel reads and writes, potentially increasing the overall performance of the filesystem.

No user action needed, and you should not see any differences in the way you are using the scratch space.

Some PACE accounts temporarily disabled

Posted by on Friday, 22 July, 2016

Some of our uses will not be able to login to PACE clusters after the maintenance day due to ongoing scratch data migration process. Here are the details:

One of the tasks performed during this maintenance day (http://blog.pace.gatech.edu/?p=5943) was to deploy a new fileserver to serve the “scratch” storage.

The new system is now operational, however the scratch data for some of the accounts are still in the process of migration. We temporarily inactivated these accounts to prevent modifications to incomplete data sets. The submitted jobs by these users are also held by the scheduler. Once the transfer is complete for a user, we will enable the account, release user’s jobs, and send a notification email. Currently we have no means to estimate how long the migration will take, but we are doing all we can to make this process go as quickly as possible.

If you are one of these users and need any of your data urgently, please contact us pace-support@oit.gatech.edu. All of your data are intact and accessible by us. We will try to find a solution for delivering the data you need for your research.

PACE quarterly maintenance – April ’16

Posted by on Wednesday, 13 April, 2016

Greetings!

The PACE team is once again preparing for maintenance activities that will occur starting at 6:00am Tuesday, April 19 and continuing through Wednesday, April 20.  We are planning several improvements that hopefully will provide a much better PACE experience.

GPFS storage improvements

Removal of all /nv/gpfs-gateway-* mount points (user action recommended): In the past, we had noticed performance and reliability problems with mounting GPFS natively on machines with slow network connections (including most headnodes, some compute nodes, and some system servers). To address this problem, we deployed a physical ‘gateway’ machine that mounts GPFS natively and serves its content via NFS to machines with slow network (see http://blog.pace.gatech.edu/?p=5842).

We have been mounting this gateway on *all* of the machines using these locations:

/nv/gpfs-gateway-pace1
/nv/gpfs-gateway-scratch1
/nv/gpfs-gateway-menon1

Unfortunately, these mount points caused some problems in the longer run, especially when a system variable (PBS_O_WORKDIR) being assigned these locations as the “working directory” for jobs even on machines with fast network connections. As a result, a large fraction of the data operations went through the gateway server, instead of the GPFS server, causing significant slowness.

We partially addressed this problem by fixing the root cause for unintended PBS_O_WORKDIR assignment, and also with user communication/education.

On this maintenance day, we are getting rid of these mount points completely. Instead, GPFS will always be mounted on:

/gpfs/pace1
/gpfs/scratch1
/gpfs/menon1

Regardless of how that particular node is mounting GPFS (natively or via the gateways).

User action: We would like to ask all of our users to please check your scripts to ensure that old locations are not being used. Jobs that try to use these locations will fail after the maintenance day (including those that have already been submitted).

A new GPFS gateway (no user action required): We increasingly rely on GPFS filesystem for multiple storage needs, including the scratch, majority of project directories, and some home directories.  While the gateway provided some benefits, some users continued to report unresponsive/slow commands on headnodes due to a combination of high levels of activity and limited NFS performance.
On this maintenance, we are planning to deploy a second gateway server to separate headnodes from other functions (compute nodes and backup processes). This will improve the responsiveness of headnodes, providing our users with a better interactivity on headnodes. In other words, you will see much less slowness when running system commands, such as “ls”.

GPFS server and client tuning (no user action required): We identified several configuration tuning parameters to improve the performance and reliability of GPFS in light of vendor recommendations and our own analysis. We are planning to  apply these configuration changes on this maintenance day as a fine tuning step.

Decommissioning old Panasas scratch (no user action required)

When we made the switch to the new scratch space (GPFS) on the January maintenance, we kept the old (Panasas) system accessible as read-only. Some users received a link to their old data if their migration had not completed within the maintenance window. We are finally ready to pull the plug on this Panasas system. You should have no dependencies on this system anymore, but please contact the PACE support as soon as possible if you have any concerns or questions regarding decommissioning of this system.

Enabling debug mode (limited user visibility)

RHEL6, which has been used on all PACE systems for a long while,  optionally  comes with a implementation of the memory-allocation functions to perform additional heap error/consistency checks at runtime. We’ve had this functionality installed, but memory errors have been silently ignored per our configuration, which is not ideal. We are planning to change the configuration to print diagnostics on the stderr when an error is detected. Please note, you should not see any differences in the way your codes are running, this only changes how memory errors are reported.  This behavior is controlled by the MALLOC_CHECK_ environment variable. A simple example is when a dynamically allocated array is freed twice (e.g. using the ‘free’ statement in C). Here’s a demo for different behaviors for three different values of MALLOC_CHECK_ when an array is freed twice:

MALLOC_CHECK_=0
(no output)


MALLOC_CHECK_=1

*** glibc detected *** ./malloc_check: free(): invalid pointer: 0x0000000000601010 ***

MALLOC_CHECK_=2

Aborted (core dumped)

We currently have this value set to “0” and will make “1” the default to dump some description of the error(s). If this change is causing any problems for you, or you simply don’t want any changes in your environment, then you can simply assign “0” to this value in your “~/.bashrc” to overwrite the new default.

Removal of compatibility links for migrated storage (some user action may be required)

We had migrated some of the NFS project storages (namely pcee1 and pme[1-8]) to GPFS in the past. When we did that, we placed links in the older storage (that starts with /nv/…) that points to the new gpfs location (starts with /gpfs/pace1/project/…) to protect active jobs from crashing. This was only temporary to facilitate the transition.

As a part of this maintenance day, we are planning to remove these links completely. We already contacted all of the users whose project are on these locations and confirmed that their ~/data links are updated accordingly, so we expect no user impact. That said, if you are one of these users, please make sure that none of your scripts reference to the old locations mentioned in our email.

Scheduler updates (no user action required)

We have a patched version of the resource manager (Torque) that had been deployed on the scheduler servers shortly after the January maintenance day. This patch addresses a bug in the administration functions only. While it’s not critical for compute nodes, we will go ahead and update all compute nodes to bring their version at par with the scheduler for consistency. This update will not cause any visible differences for the users. No user action required.

Networking Improvements (no user action required)

Spring is here and it’s time for some cleanup. We will get rid of unused cables in the datacenter and remove some unused switches from the racks. We are also planning some recabling to take better advantage of existing switches to improve redundancy. We will continue to test and enable jumbo frames (where possible) to lower networking overhead. None of these tasks require user actions.

Diskless node transition (no user action required)

We will continue the transition away from diskless nodes that we started in October 2015.  This mainly affects nodes in the 5-6 years old range.  Apart from more predictable performance of these nodes, this should also be a transparent change.

Security updates (no user action required)

We are also planing to update some system packages and libraries to address known security vulnerabilities and bugs. There should be no user impact.

Free XSEDE/NCSI Summer Workshops

Posted by on Friday, 25 March, 2016

SUMMARY:

*FREE* REGISTRATION IS OPEN!
XSEDE/National Computational Science Institute Workshops
Summer 2016

(1) Computing MATTERS: Inquiry-Based Science and Mathematics
Enhanced by Computational Thinking

(1a) May 16-18 2016, Oklahoma State U, Stillwater OK
(1b) July 18-20 2016, Boise State U, Boise ID
(1c) Aug 1-3 2016, West Virginia State U, Institute WV

(2) LittleFe Curriculum Module Buildout
June 20-22 2016, Shodor, Durham NC

Contact: Kate Cahill (kcahill@osc.edu)
http://computationalscience.org/workshops2016

DETAILS:

The XSEDE project is pleased to announce the opening of
registrations for faculty computational science education
workshops for 2016.

There are no fees for participating in the workshops.

The workshops also cover local accommodations and food during the
workshop hours for those outside of commuting distance to the host
sites.

This year there are three workshops at various locations focused
on Inquiry-Based Science and Mathematics Enhanced by Computational
Thinking and one workshop on the LittleFe Curriculum Module
Buildout.

The computational thinking workshops are hosted at Oklahoma State
University on May 16-18, 2016, at Boise State University on
July 18-20, 2016, and at West Virginia State University on
August 1-3.

The Little Fe curriculum workshop will be held on June 20-22 at
Shodor Education Foundation.

To register for the workshop, go to

http://computationalscience.org/workshops2016

and begin the registration process by clicking on the Register
through XSEDE button for the relevant workshop.

Participants will be asked to create an XSEDE portal account
if they do not yet have one.

Following that registration, participants will be directed back to

http://computationalscience.org/workshops2016

to provide additional information on their background and travel
plans.

A limited number of travel scholarships may also be available as
reimbursements for receipted travel to more distant faculty
interested in attending the workshops.

The scholarships will provide partial or full reimbursement of
travel costs to and from the workshops.

Preference will be given to faculty from institutions that are
formally engaged with the XSEDE education program and to those
who can provide some matching travel funds.

Recipients are expected to be present for the full workshop.

The travel scholarship application is available via a link at

http://computationalscience.org/workshops2016

For questions about the summer workshops please contact:

Kate Cahill (kcahill@osc.edu)