PACE A Partnership for an Advanced Computing Environment

December 11, 2018

[Resolved] Wide spread problems impacting all PACE machines

Filed under: Uncategorized — Semir Sarajlic @ 10:31 pm

Update (12/21, 10:15am): A correction: The problems have started this morning around 8:15am, not yesterday evening as previously communicated. The systems were back online at 8:45am.

Update (12/21, 9:15am): There has been another incident started last night, causing the same symptoms (hanging and unavailability of scientific repository). OIT storage engineers reverted the services on the redundant system (high availability pair) and the storage is available again. We continue to work on investigating the root cause of recurring failures experienced since the past several weeks.

Update (12/12, 6:30pm): The services are successfully migrated to the high availability pair and the filesystems are once again accessible. We’ll continue to monitor the systems and take a close look into the errant components. It’s still a possibility that some of these problems may recur, but we’ll be ready to address them should they happen.

Update (12/12, 5:30pm): Unfortunately the problems seem to be coming back. We continue to work on this. Thank you for your patience.

Update (12/12, 11:30am): We identified the root cause as a configuration conflict between two devices and resolved the problem. All systems are back online and available for jobs.

Update (12/12, 10:00am): Our battle with the storage system continues. This filesystem is designed as a high availability service with redundancy components to prevent such situations, but unfortunately the second system failed to take over successfully. We are investigating the possibility of network being the culprit. We continue to work rigorously to bring the systems back online ASAP.

Update (12/11, 9:00pm): Continued problems, we are working on it with support from related OIT units. 

Update (12/11, 7:30pm): We mitigated the issue, but the intermittent problems may continue to recur until the root cause is addressed. We continue to work on it.

Original message:

Dear PACE Users,

At around 3:45pm on Dec 11  the fileserver that serves the shared “/usr/local” on all PACE machines started experiencing problems. This issue causes several wide-spread problems including:

  • Unavailability of the PACE repository (which is in “/usr/local/pacerepov1”)
  • Crashing of newly started jobs that run applications in the PACE repository
  • New logins will hang

Running applications that have their executables cached in memory may continue to run without problems, but it’s very difficult to tell exactly how different applications will be impacted.

We are working to resolve these problems ASAP and will keep you updated on this post.





Powered by WordPress