PACE A Partnership for an Advanced Computing Environment

May 3, 2018

[Resolved] Large Scale Storage Problems

Filed under: Uncategorized — Semir Sarajlic @ 7:46 pm

Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.

update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.

update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.

 

Original Message:

We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.

We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.

We are sorry for this inconvenience and thank you for your patience.

 

 

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress