Update (3/29, 11:00am): We continue to see some problems overnight and this morning. It’s important to mention that these back-to-back problems, namely power loss, network, GPFS storage failures and readonly headnodes, are separate events. Some of these could be related, and they probably are, and network is the most likely culprit. We are still investigating with the help of storage and network teams.
The readonly headnodes is an unfortunate outcome of VM storage failures. We restored these system and VM storages and will start rebooting the headnodes shortly. We can’t tell for sure that these events will not recur. Frequent reboots of headnodes and denied logins should be expected while we are recovering these systems. Please be mindful of these possibilities and save your work frequently, or refrain from using headnodes for anything but submitting jobs.
The compute nodes appear to be mostly stable, although we identified several with leftover storage issues.
Update (3/28, 11:30pm): Thanks to instant feedback from some of the users, we identified a list of headnodes that got read only because of the storage issues. We started rebooting them for filesystem checks. This process may take more than an hour to complete.
Update (3/28, 11:00pm): At this point, we resolved the network issues, restored storage systems and brought back compute nodes, which started running jobs.
We believe that the cascading issues were triggered by a network problem, we will continue to monitor the systems and continue to work with the vendor tomorrow to find out more.
Update (3/28, 9:30pm): All network and storage related issues are addressed, we started bringing nodes back online and running tests to make sure they are healthy and can run jobs.
Original Post:
As several of you already noticed and reported, PACE main storage systems are experiencing problems. The symptoms indicate a wide scale network event and we are working with the OIT Network Team to investigate this issue.
This issue has potential impact on jobs, so please refrain from submitting new jobs until all systems and services are stabilized again.
We don’t have an estimated time for resolution yet, but will continue to update this blog with the progress.