Current Status (5/3 4:30pm): Storage problems are resolved, all compute nodes are back online, accepting jobs. Please resubmit crashed jobs and contact pace-support@oit.gatech.edu if there is anything we can assist with.
update (5/3 4:15pm): We found that the storage failure was caused by a series of tasks we have been performing with guidance from the vendor, in preparation for the maintenance day. These steps were considered safe and no failures were expected. We are still investigating to find more about which step(s) lead to this cascading failure.
update (5/3 4:00pm): All of the compute nodes will appear offline and will not accept jobs until this issue is resolved.
Original Message:
We received reports of the main PACE storage (GPFS) failures around 3:30pm today (5/3, Thr), impacting jobs. We found that this issue applies to all GPFS systems (pace1, pace2, menon1), with a large scale impact PACE-wide.
We are actively working with the vendor to resolve this issue urgently and will continue to update this post as we find more about the root cause.
We are sorry for this inconvenience and thank you for your patience.