PACE A Partnership for an Advanced Computing Environment

March 24, 2018

[RESOLVED] Major power failure at PACE datacenter, jobs are impacted

Filed under: Uncategorized — Semir Sarajlic @ 3:11 pm

Update (3/26, 12:15pm): At this point, most nodes are back online, except for the nodes located on the P-row. To see if your cluster is on the P-row, you can run ‘pace-check-queue <queue_name>’ and look for nodes named as either “rich133-p*” or “iw-p*” in the list. Gryphon and Uranus are two large clusters that are impacted, and there are many other smaller clusters with nodes on this row. We are actively working to bring these nodes back online ASAP.

Update (3/24, 6:15pm): We have powered on majority of compute nodes which started running jobs again. We’ll continue to online more nodes during next week. Please contact pace-support@oit.gatech.edu if you are seeing continued job crashes or nodes that are not mounting storage.

Update (3/24, 11:22am): We have identified affected queues as follows (not a complete list):

apurimac-bg-6,aryabhata-6,ase1-debug-6,atlas-6,complexity,datamover,davenporter,
epictetus,granulous,jabberwocky-6,kennedy-lab,martini,megatron,monkeys_gpu,monkeys,
mps,njord-6,semap-6,skadi,uranus-6,breakfix,gryphon-debug,gryphon-ivy,gryphon-prio,
gryphon,gryphon-test,gryphon-tmp,roc,apurimacforce-6,b5force-6,biobot,biocluster-6,
bioforce-6,biohimem-6,ceeforce,chemprot,chemxforce,cns-6-intel,cnsforce-6,
critcelforce-6,critcel-prv,critcel,cygnusforce-6,cygnus,dimerforce-6,eceforce-6,
enveomics-6,faceoffforce-6,faceoff,flamelforce,force-6,force-gpu,habanero,hummus,
hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joeforce,
kastellaforce-6,mathforce-6,mday-test,microcluster,micro-largedata,optimusforce-6,
optimus,prometforce-6,prometheus,rombergforce,sonarforce-6,spartacusfrc-6,spartacus,
threshold,try-6

 

Original Post:

What’s happening?

PACE’s Rich datacenter suffered a major power failure at around 8:30am this morning, impacting roughly half of the compute nodes. Storage systems are not affected and your data are safe, but all of the jobs running on affected nodes have been killed. Please see below for a list of all impacted queues.

Current Situtation:
OIT Operations team has restored power and PACE is bringing nodes back online as soon as possible. This is a sequential process and it may take several hours to online all of the nodes.

What user action is needed?
Please check your jobs to see which ones have crashed and re-submit them as needed. We are still working on bringing nodes back online, but it’s safe to submit jobs now. Submitted jobs will wait in the queue and start running once the nodes are available again.

Please follow our updates on pace-availability email list and blog.pace.gatech.edu.

Thank you,
PACE Team

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress