PACE A Partnership for an Advanced Computing Environment

March 28, 2018

[RESOLVED] PACE Storage Problems

Filed under: Uncategorized — Semir Sarajlic @ 8:03 pm

Update (3/29, 11:00am): We continue to see some problems overnight and this morning. It’s important to mention that these back-to-back problems, namely power loss, network, GPFS storage failures and readonly headnodes, are separate events. Some of these could be related, and they probably are, and network is the most likely culprit. We are still investigating with the help of storage and network teams.

The readonly headnodes is an unfortunate outcome of VM storage failures. We restored these system and VM storages and will start rebooting the headnodes shortly. We can’t tell for sure that these events will not recur. Frequent reboots of headnodes and denied logins should be expected while we are recovering these systems. Please be mindful of these possibilities and save your work frequently, or refrain from using headnodes for anything but submitting jobs.

The compute nodes appear to be mostly stable, although we identified several with leftover storage issues.

Update (3/28, 11:30pm):  Thanks to instant feedback from some of the users, we identified a list of headnodes that got read only because of the storage issues. We started rebooting them for filesystem checks. This process may take more than an hour to complete.

Update (3/28, 11:00pm):  At this point, we resolved the network issues, restored storage systems and brought back compute nodes, which started running jobs.

We believe that the cascading issues were triggered by a network problem, we will continue to monitor the systems and continue to work with the vendor tomorrow  to find out more.

Update (3/28, 9:30pm): All network and storage related issues are addressed, we started bringing nodes back online and running tests to make sure they are healthy and can run jobs.

Original Post:

As several of you already noticed and reported, PACE main storage systems are experiencing problems. The symptoms indicate a wide scale network event and we are working with the OIT Network Team to investigate this issue.

This issue has potential impact on jobs, so please refrain from submitting new jobs until all systems and services are stabilized again.

We don’t have an estimated time for resolution yet, but will continue to update this blog with the progress.

March 24, 2018

[RESOLVED] Major power failure at PACE datacenter, jobs are impacted

Filed under: Uncategorized — Semir Sarajlic @ 3:11 pm

Update (3/26, 12:15pm): At this point, most nodes are back online, except for the nodes located on the P-row. To see if your cluster is on the P-row, you can run ‘pace-check-queue <queue_name>’ and look for nodes named as either “rich133-p*” or “iw-p*” in the list. Gryphon and Uranus are two large clusters that are impacted, and there are many other smaller clusters with nodes on this row. We are actively working to bring these nodes back online ASAP.

Update (3/24, 6:15pm): We have powered on majority of compute nodes which started running jobs again. We’ll continue to online more nodes during next week. Please contact pace-support@oit.gatech.edu if you are seeing continued job crashes or nodes that are not mounting storage.

Update (3/24, 11:22am): We have identified affected queues as follows (not a complete list):

apurimac-bg-6,aryabhata-6,ase1-debug-6,atlas-6,complexity,datamover,davenporter,
epictetus,granulous,jabberwocky-6,kennedy-lab,martini,megatron,monkeys_gpu,monkeys,
mps,njord-6,semap-6,skadi,uranus-6,breakfix,gryphon-debug,gryphon-ivy,gryphon-prio,
gryphon,gryphon-test,gryphon-tmp,roc,apurimacforce-6,b5force-6,biobot,biocluster-6,
bioforce-6,biohimem-6,ceeforce,chemprot,chemxforce,cns-6-intel,cnsforce-6,
critcelforce-6,critcel-prv,critcel,cygnusforce-6,cygnus,dimerforce-6,eceforce-6,
enveomics-6,faceoffforce-6,faceoff,flamelforce,force-6,force-gpu,habanero,hummus,
hydraforce,hygene-6,hygeneforce-6,isabella-prv,isblforce-6,iw-shared-6,joeforce,
kastellaforce-6,mathforce-6,mday-test,microcluster,micro-largedata,optimusforce-6,
optimus,prometforce-6,prometheus,rombergforce,sonarforce-6,spartacusfrc-6,spartacus,
threshold,try-6

 

Original Post:

What’s happening?

PACE’s Rich datacenter suffered a major power failure at around 8:30am this morning, impacting roughly half of the compute nodes. Storage systems are not affected and your data are safe, but all of the jobs running on affected nodes have been killed. Please see below for a list of all impacted queues.

Current Situtation:
OIT Operations team has restored power and PACE is bringing nodes back online as soon as possible. This is a sequential process and it may take several hours to online all of the nodes.

What user action is needed?
Please check your jobs to see which ones have crashed and re-submit them as needed. We are still working on bringing nodes back online, but it’s safe to submit jobs now. Submitted jobs will wait in the queue and start running once the nodes are available again.

Please follow our updates on pace-availability email list and blog.pace.gatech.edu.

Thank you,
PACE Team

March 21, 2018

[RESOLVED] Continued storage slowness, impacting compute nodes as well

Filed under: Uncategorized — Semir Sarajlic @ 4:05 pm

Update (3/22, 10:00AM): The initial findings point to hardware issues, but we don’t have a conclusive diagnosis yet. The vendor is collecting new logs to better understand the issue. We have been fixing some of the issues we found in the network and wondering if they made any difference at all. If you have opened tickets with us, please give us an update on your current experience, whether it’s better, same or worse.

Data is everything when it comes to computing and we certainly understand how these issues can have a big impact on your research progress. We are doing everything we can, with the support of the vendor, to resolve these issues ASAP.

Thank you for your feedback, cooperation and patience.

Update (3/21, 8:00PM): We continue to work with the vendor and found several issues to fix, but the system is not fully stabilized yet. Please keep an eye on this post for more updates.

Original Post:

The storage slowness issues that were initially reported on headnodes seem to be impacting some of the compute nodes as well. We are actively working to address this issue with some guidance from the vendor.

If your jobs are impacted, please open a ticket with pace-support@oit.gatech.edu and report the job IDs. This will allow us to identify specific nodes that could be contributing to the problem.

The intermittent nature of the problem is making troubleshooting difficult. We’d appreciate your patience while we are trying to identify the culprit.

Thank you.

 

March 5, 2018

[RESOLVED] PACE login nodes slowness

Filed under: Uncategorized — Semir Sarajlic @ 11:11 pm
As reported by many of our users, we are experiencing storage related slowness on the majority of login nodes. At this point, we have reason to believe that this is caused by heavy-duty data operations running on login nodes by several users. We are currently working on pinpointing these processes contributing to the problem and the users running them.
We’d like to once again ask all of our users to not perform any data operations (e.g. SFTP connections, rsync, scp,  tar, zip/unzip, etc) on login nodes. Instead, please use the data mover machine (iw-dm-4.pace.gatech.edu). This will not only help keep the login nodes responsive, but will provide you with a significantly faster data performance compared to login nodes.
This issue has been recurring for a long while and PACE has been working on an alternative mechanisms to address this issue permanently. We now have an experimental solution in place and looking for a small group of volunteers to test it. If you are experiencing slowness on login nodes and would like to volunteer for some testing, please contact mehmet.belgin@oit.gatech.edu directly.
In the mean time, PACE system engineers will continue to work on this issue and eliminate the slowness as soon as possible.

 

Powered by WordPress