PACE is experiencing problems after a Infiniband (IB) network failure, which affects MPI jobs as well as IB connected storage including GPFS (project space) and PanFS (scratch space). It is possible that this problem caused crashed or hanging jobs.
The Infiniband network is restored at this point and we are now working to restore the storage mounts. We also paused job submissions to prevent new jobs from starting. We will allow jobs once the problems are completely resolved.
Thank you for your patience.
PACE team