PACE A Partnership for an Advanced Computing Environment

May 17, 2013

PC1 & PB1 filesystems back online

Filed under: News,tech support — Tags: — Semir Sarajlic @ 3:25 am

Hey folks,

It looks like we may have finally found the issue tying up the PB1 file server and the occasional lock up of the PC1 file server. We’ve isolated the compute nodes that seemed to be generating the bad traffic, and have even isolated the processes which appear to have compounded the problem on a pair of shared nodes (thus linking the two server failures). With any luck, we’ll get those nodes online once their other jobs complete or are cancelled.

Thank you for the patience you have given us while we tracked this problem down. We know it was quite inconvenient, but we have a decent picture of what occurred and thankfully it was something that is very unlikely to repeat itself.

May 1, 2013

RESOLVED: Hardware Failure for /PC1 filesystem users

Filed under: Uncategorized — Tags: — Semir Sarajlic @ 10:39 pm

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

April 30, 2013

Hardware Failure for /PC1 filesystem users

Filed under: Uncategorized — Tags: — Semir Sarajlic @ 5:52 pm

Hey folks,

The fileserver providing access to the filesystems hosted under (/nv)/pc1 has suffered a severe failure, requiring replacement parts before we can bring it online again. We are in contact with the vendor to try and resolve this as quickly as possible.


October 5, 2012

Cygnus FS pc5 online…mostly.

Filed under: tech support — Tags: — Semir Sarajlic @ 8:38 pm

We have been able to bring /nv/pc5 back online, but at a cost to redundancy. One of the network interfaces/cables/switches is not behaving, but when we tried disconnecting various combinations of cables, we found one that caused the filesystem to be immediately available to all nodes.

Considering how close maintenance day is (10/16/12), spending time isolating the cable/switch/interface problem now only means more time for this filesystem to be offline as equipment gets retested. Waiting until maintenance day will cause the least disruption for Cygnus pc5 users who have their last run of jobs and take some time pressure off of us to make sure we have resolved the issue in its entirety before bringing all resources back online.

Despite the loss of redundancy, functionality is NOT affected. Only in the case of an additional switch or cable failure between now on October 16 will functionality be impacted.

Cygnus File System pc5 offline

Filed under: tech support — Tags: — Semir Sarajlic @ 8:01 pm

It appears that we have an issue with the server housing the /nv/pc5 filesystem, which contains a subset of the Cygnus cluster users. We’re trying to isolate the source of the problem, but we have yet to actually find a pattern to why it is available on some nodes and not on others.

September 29, 2012

Joe Cluster Status

Filed under: tech support — Tags: — Semir Sarajlic @ 8:08 pm

Around 8, 8:30pm on September 28, 2012, a power event took down the TSRB data center, knocking a significant fraction of the Joe cluster offline.

With assistance from Operations, we are now bringing these nodes online after determining that several of the management switches for these nodes did not recover from the event gracefully. As these switches control our ability to manage the nodes, we had to wait until the switches were available to bring nodes online, now at about 4pm on September 29, 2012.

Jobs that were running on these nodes (iw-a2-* and iw-a3-*) at the time of the outage may have terminated abnormally. Jobs scheduled but not running should be fine.

UPDATE @ 4:40pm, 2012-09-29: All nodes are online.

September 13, 2012

Joe file server back online

Filed under: tech support — Tags: — Semir Sarajlic @ 7:17 pm

After working with the network team, we appear to have stabilized the networking for the file server. We apologize for the inconvenience.

Joe file server still having difficulties

Filed under: tech support — Tags: — Semir Sarajlic @ 1:48 pm

The network interfaces on the file server providing service to Joe cluster are currently having problems determining which is up and which is down. This started around 4:30am, and we are engaging the network team to isolate the problem to the machine, cables, or switches.

September 12, 2012

Joe Fileserver fixed

Filed under: tech support — Tags: — Semir Sarajlic @ 8:51 pm

The fileserver that houses Joe users’ data ( hp3 / pj1 ) started acting squirrelly this morning, finding itself unable to connect to the PACE LDAP server. That, in turn, caused Joe users to have problems logging in or having their jobs hang up because the fileserver could not authenticate users/jobs.

Restarting all the services on the fileserver rectified the problem.

November 14, 2011

Cygnus nodes back online

Filed under: tech support — Tags: — Semir Sarajlic @ 4:09 pm

The storage problem has been fixed, and the nodes are available for use. Thanks for your patience.

Older Posts »

Powered by WordPress