PACE A Partnership for an Advanced Computing Environment

May 17, 2013

PC1 & PB1 filesystems back online

Filed under: News,tech support — Tags: — Semir Sarajlic @ 3:25 am

Hey folks,

It looks like we may have finally found the issue tying up the PB1 file server and the occasional lock up of the PC1 file server. We’ve isolated the compute nodes that seemed to be generating the bad traffic, and have even isolated the processes which appear to have compounded the problem on a pair of shared nodes (thus linking the two server failures). With any luck, we’ll get those nodes online once their other jobs complete or are cancelled.

Thank you for the patience you have given us while we tracked this problem down. We know it was quite inconvenient, but we have a decent picture of what occurred and thankfully it was something that is very unlikely to repeat itself.

May 2, 2013

RESOLVED (again…): PC1 server back online

Filed under: Uncategorized — Semir Sarajlic @ 5:54 am

Hey folks, it’s me again.

As of this post, I have been able to keep the system running 3 solid hours doing the catch-up backup runs with no issue. The previous announcement and subsequent embarrassment made me wary of announcing this too early again, but I think the system really is stable now, so have at it.

Compute away…

bnm

May 1, 2013

PC1 file server still unaccessible…

Filed under: Uncategorized — Semir Sarajlic @ 11:19 pm

*sigh*

I made sure to let the system get loaded down for a while with the
backups and such before I made that announcement, but sure enough,
something is still wrong here as now the replacement file server has
crashed.

Looking into it now, but now I have to suspect something on the OS
level has gone terribly wrong in the past few days.

Sorry folks.

RESOLVED: Hardware Failure for /PC1 filesystem users

Filed under: Uncategorized — Tags: — Semir Sarajlic @ 10:39 pm

Hey folks,

The PC1 file system is now back online. No data was lost (no disks were harmed in the incident), though you probably need to check the status of any running or scheduled jobs.

We had to use a different (though equivalent) system to get you back online, and on the next Maintenance Day (July 16), should we need to switch to the actual replacement hardware provided by Penguin we will do so; otherwise you should be ready to rock and roll.

Sorry about the delays, as some of the needed parts were not available.

Powered by WordPress