GT Home : : Campus Maps : : GT Directory

Author Archive

[RESOLVED] Post-maintenance Storage Issues

Posted by on Saturday, 12 August, 2017

Update: GPFS storage is stabilized again. There remains several steps we need to take to complete this work, which can be completed without a downtime. We may need to take some nodes temporarily offline, which will be done in coordination with you without impacting the running jobs.
—–

PACE systems started experiencing wide-spread problems with the GPFS storage shortly after releasing jobs after the maintenance tasks are complete. At fist glance,  they seem to be related to the Infiniband network.

We would like to advise against submitting new jobs until these issues are fully resolved. We will continue to work on a resolution and keep you updated on the progress.

Thank you for your patience.

PACE systems experiencing problems

Posted by on Wednesday, 9 August, 2017
Update: The immediate issue is now resolved. We continue to work with the vendor to find more about what caused the problem. If you need assistance with failed jobs or unavailable headnodes, please contact pace-support@oit.gatech.edu.
We are experiencing a storage issue that’s impacting majority of the home directories and some services, with potential impact on running jobs. We are currently in process of assessing the overall impact and investigating the root cause. We’ll update our blog (http://blog.pace.gatech.edu/) as we know more about the failures.
We are sorry of the inconvenience and thank you for your patience.
Update: Impacted filesystems:
ap1
ap2
ap3
ap4
ap5
ap6
hase1
hb51
hbiobot1
hcee2
hcfm1
hchpro1
hcoc1
hface1
hggate1
hhygene1
hjabberwocky1
hmart1
hmedprint1
hmeg1
hmicro1
hneutrons1
hnjord1
hp1
hp10
hp11
hp12
hp13
hp14
hp15
hp16
hp17
hp18
hp19
hp2
hp20
hp21
hp22
hp23
hp24
hp25
hp26
hp27
hp28
hp29
hp4
hp5
hp6
hp7
hp8
hp9
hpampa1
hsemap1
hska1
hsonar1
hthreshold1
html1

PACE quarterly maintenance – (Aug 10-12, 2017)

Posted by on Thursday, 3 August, 2017

Dear PACE users,

PACE clusters and systems will be taken offline at 6am on Thursday, Aug 10 through the the end of Saturday (Aug 12). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Storage (no user action needed)

We are working with the vendor to reconfigure our multiple GPFS filesystems for performance fine-tuning and better support for large scale deployments. Some of these configurations will be applied on the maintenance day because they require downtime.

Network (no user action needed)

PACE infiniband (IB) network requires some configuration changes to handle the rapid growth in the number of PACE nodes. We have identified several configuration parameters that will potentially reduce the occurrence of nodes losing their connectivity to GPFS (which relies on the IB network), causing intermittent job crashes.

Schedulers (postponed, no user action needed)

We have communicated our plans to upgrade scheduler in several occasions in the past, but skipped this task in the past maintenance days due to the bugs that we had uncovered. Despite the promising progress on the resolution of these bugs by the vendor, they are not fully resolved and tested yet. For this reason, we decided to once again postpone upgrading plans and keep the current versions until we have a bug-free and well-tested version.

Scheduler-based monitoring and analysis (no user action needed)

PACE research scientists started a collaboration with Texas Advanced Computing Center (TACC) to develop a new tool to analyze scheduler logs to gain insights about usage trends. For more details, please check https://doi.org/10.1145/3093338.3093351

This tool heavily relies on a widely used utility named ‘PBSTools’, which is developed by Ohio Supercomputer Center. Our current installation of PBSTools is old, buggy and very slow. We will upgrade this tool and its database on the maintenance day to ensure that no job info will be lost during transition.

Power Work  (no user action needed)

PACE and the OIT Operations team will perform some work in the datacenter to improve electrical capacity and balance. This includes moving and/or replacing some of the in-rack Power Distribution Units (PDUs).

Software Repository (no user action needed)

Our NIST SP800-171 conforming secure cluster has a specific need for a dedicated copy of the PACE software repository. We have previously completed an initial replication of the files and will simply point the nodes to the replica during the maintenance window.

 

 

 

 

PACE is experiencing storage (GPFS) problems

Posted by on Friday, 16 June, 2017

We are experiencing intermittent problems with the GPFS storage system that hosts most of the project directories.

We are working with the vendor to investigate the ongoing issues. At this moment we don’t know whether they are related to yesterday’s power/cooling failures or not, but we will update the PACE community as we find out more.

This issue has potential impact on running jobs and we are sorry for this inconvenience.

PACE datacenter experienced a power/cooling failure

Posted by on Friday, 16 June, 2017
What happened: We had a brief power failure in our datacenter, which took out cooling in racks running chilled water. This impacted about 160 nodes from various queues, with potential impact on running jobs.
Current Situation: Some cooling has been restored, however we had to issue a shut down to a couple of the highest temperature racks that were not cooling down (p41, k30, h43, c29, c42). We are keeping a close eye on the remaining racks that were in the risk area in coordination with the Operations team as they continue to monitor temperatures in these racks.
We will start bringing the down nodes online once the cooling issue is fully resolved.
What can you do: Please resubmit failed jobs (if any) if you were using any of the queues listed below. As always, contact pace-support@oit.gatech.edu for any kind of assistance you may need.
Thank you for your patience and sorry for the inconvenience.

Impacted Queues:

—————————
apurimac-6
apurimacforce-6
atlas-6
atlas-debug
b5force-6
biobot
biobotforce-6
bioforce-6
breakfix
cee
ceeforce
chemprot
chowforce-6
cnsforce-6
critcel
critcel-burnup
critcelforce-6
critcel-prv
cygnus
cygnus-6
cygnus64-6
cygnusforce-6
cygnus-hp
davenprtforce-6
dimerforce-6
ece
eceforce-6
enveomics-6
faceoff
faceoffforce-6
force-6
ggate-6
granulous
gryphon
gryphon-debug
gryphon-prio
gryphon-tmp
hygeneforce-6
isabella-prv
isblforce-6
iw-shared-6
martini
mathforce-6
mayorlab_force-6
mday-test
medprint-6
medprintfrc-6
megatron
megatronforce-6
microcluster
micro-largedata
monkeys_gpu
mps
njordforce-6
optimusforce-6
prometforce-6
prometheus
radiance
rombergforce
semap-6
skadi
sonarforce-6
spartacusfrc-6
threshold-6
try-6
uranus-6

Infiniband switch failure causing partial network and storage unavailability

Posted by on Thursday, 25 May, 2017
We experienced an infiniband (IB) switch failure, which impacted several racks of nodes that are connected to this switch. This issue caused MPI job crashes and GPFS unavailability.

The switch is now back online and it’s safe to submit new jobs.

If you are using one or more of the queues (listed below), please check your jobs and re-submit them if necessary. One indication of this issue is “Stale file handle” error messages that may appear in the job output or logs.

Impacted Queues:
=============
athena-intel
atlantis
atlas-6-sunge
atlas-intel
joe-6-intel
test85
apurimacforce-6
b5force-6
bioforce-6
ceeforce
chemprot
cnsforce-6
critcelforce-6
cygnusforce-6
dimerforce-6
eceforce-6
faceoffforce-6
force-6
hygeneforce-6
isblforce-6
iw-shared-6
mathforce-6
mayorlab_force-6
medprint-6
nvidia-gpu
optimusforce-6
prometforce-6
rombergforce
sonarforce-6
spartacusfrc-6
try-6
testflight
novazohar

PACE quarterly maintenance – May 11, 2017

Posted by on Monday, 8 May, 2017

PACE clusters and systems will be taken offline at 6am this Thursday (May 11) through the the end of Saturday (May 13). Jobs with long walltimes will be held by the scheduler to prevent them from getting killed when we power off the nodes. These jobs will be released as soon as the maintenance activities are complete.

Planned improvements are mostly transparent to users, requiring no user action before or after the maintenance.

Systems

  • We will deploy a recompiled kernel that’s identical to the current version except for a patch that addresses the dirty cow vulnerability. Currently, we have mitigation in place that prevents the use of debuggers and profilers (e.g. gdb, strace, Allinea DDT, etc). After the deployment of the patched kernel, these functions will once again be available for all nodes. Please let us know if you continue to have problems debugging or profiling your codes after the maintenance day.

Storage

  • Firmware updates on all of the DDN GPFS storage (scratch and most of the project storage)

Network

  • Upgrades to DNS servers, as recommended and performed by OIT Network Engineering
  • Software upgrades to the PACE firewall appliance to address a known bug
  • New subnets and re-assignment of IP addresses for some of the clusters

Power

  • PDU fixes that are impacting 3 nodes in c29 rack

The date for the next maintenance day is not certain yet, but we will announce it as soon as we have it.

Please test the new patched kernel on TestFlight nodes

Posted by on Wednesday, 1 March, 2017

As some of you are already aware, the dirty cow exploit was a source of great concern for PACE. This exploit can allow a local user to gain elevated privileges. For more details, please see “https://access.redhat.com/blogs/766093/posts/2757141”.

In response, PACE has applied a mitigation on all of the nodes. While this mitigation is effective in protecting the systems, it has a downside of causing debugging tools (e.g. strace, gdb and DDT) to stop working. Unfortunately, none of the new (and patched) kernel versions made available by Red Hat supports our Infiniband network drivers (OFED), so we had to leave the mitigation running for a while. This caused inconvenience, particularly for users who are actively developing codes and relying on these debuggers.

As a long term solution, we patched the source code of the kernel and recompiled it, without changing anything else. Our initial tests were successful, so we deployed it on three of the four online nodes in the testflight queue:

rich133-k43-34-l recompiled kernel
rich133-k43-34-r recompiled kernel
rich133-k43-35-l original kernel
rich133-k43-35-r recompiled kernel

We would like to ask you to please test your codes on this queue. Our plan is to deploy this recompiled kernel to all of the PACE nodes, including headnodes and compute nodes. We would like to make sure that your codes will continue to run after this deployment without any difference.

The deployment will be a rolling update, that is, we will opportunistically patch nodes starting from the idle nodes. So, there will be a mix of nodes with old and recompiled kernels in the same queues until the deployment is complete. For this reason, we strongly recommend testing multi-node parallel applications that will include the node with the original kernel (rich133-k43-35-l) in the hostlist to test the behavior of your code with mixed hostlists.

As always, please keep your testflight runs short to allow other users to test their own codes. Please report any problems to pace-support@oit.gatech.edu and we will be happy to help. Hopefully, this deployment will be completely transparent to most users, if not all.

Power maintenance 12/19/2016 (Monday)

Posted by on Friday, 16 December, 2016

(No user action needed)

We have been informed GT Facilities will perform critical power maintenance beginning at 6am Monday 12/19/2016, in one of the PACE datacenters.

We believe, after a careful investigation, PACE systems should have sufficient power redundancy to allow the careful work to be completed without required downtime or failure. However, there is always a small risk that some jobs or service will be impacted. We will work closely with the OIT operations and facilities teams to help protect running jobs from failures. We will keep all PACE users informed of progress or should failures occur.

PACE scratch storage is now larger and faster

Posted by on Wednesday, 2 November, 2016

We have increased the capacity of the scratch storage from ~350TB to ~522TB this week, matching the capacity of the old server (Panasas) that was decommissioned back in April. The additional drives were installed without any downtime, with no impact on jobs.

This also means larger number of drives contributing to parallel reads and writes, potentially increasing the overall performance of the filesystem.

No user action needed, and you should not see any differences in the way you are using the scratch space.